15th USENIX Security Symposium

15th USENIX Security Symposium

Pp. 179–192 of the Proceedings

Static Detection of Security Vulnerabilities
in Scripting Languages

Yichen Xie Alex Aiken
Computer Science Department
Stanford University
Stanford, CA 94305
{yxie,aiken}@cs.stanford.edu

Abstract: We present a static analysis algorithm for detecting security vulnerabilities in PHP, a popular server-side scripting language for building web applications. Our analysis employs a novel three-tier architecture to capture information at decreasing levels of granularity at the intrablock, intraprocedural, and interprocedural level. This architecture enables us to handle dynamic features of scripting languages that have not been adequately addressed by previous techniques.

We demonstrate the effectiveness of our approach on six popular open source PHP code bases, finding 105 previously unknown security vulnerabilities, most of which we believe are remotely exploitable.

1 Introduction

Web-based applications have proliferated rapidly in recent years and have become the de facto standard for delivering online services ranging from discussion forums to security sensitive areas such as banking and retailing. As such, security vulnerabilities in these applications represent an increasing threat to both the providers and the users of such services. During the second half of 2004, Symantec cataloged 670 vulnerabilities affecting web applications, an 81% increase over the same period in 2003 [17]. This trend is likely to continue for the foreseeable future.

According to the same report, these vulnerabilities are typically caused by programming errors in input validation and improper handling of submitted requests [17]. Since vulnerabilities are usually deeply embedded in the program logic, traditional network-level defense (e.g., firewalls) does not offer adequate protection against such attacks. Testing is also largely ineffective because attackers typically use the least expected input to exploit these vulnerabilities and compromise the system.

A natural alternative is to find these errors using static analysis. This approach has been explored in WebSSARI [7] and by Minamide [10]. WebSSARI has been used to find a number of security vulnerabilities in PHP scripts, but has a large number of false positives and negatives due to its intraprocedural type-based analysis. Minamide's system checks syntactic correctness of HTML output from PHP scripts and does not seem to be effective for finding security vulnerabilities. The main message of this paper is that analysis of scripting languages need not be significantly more difficult than analysis of conventional languages. While a scripting language stresses different aspects of static analysis, an analysis suitably designed to address the important aspects of scripting languages can identify many serious vulnerabilities in scripts reliably and with a high degree of automation. Given the importance of scripting in real world applications, we believe there is an opportunity for static analysis to have a significant impact in this new domain.

In this paper, we apply static analysis to finding security vulnerabilities in PHP, a server-side scripting language that has become one of the most widely adopted platforms for developing web applications.¹ Our goal is a bug detection tool that automatically finds serious vulnerabilities with high confidence. This work, however, does not aim to verify the absence of bugs.

This paper makes the following contributions:

We present an interprocedural static analysis algorithm for PHP. A language as dynamic as PHP presents unique challenges for static analysis: language constructs (e.g., include) that allow dynamic inclusion of program code, variables whose types change during execution, operations with semantics that depend on the runtime types of the operands (e.g., <), and pervasive use of hash tables and regular expression matching are just some features that must be modeled well to produce useful results.

To faithfully model program behavior in such a language, we use a three-tier analysis that captures information at decreasing levels of granularity at the intrablock, intraprocedural, and interprocedural levels. This architecture allows the analysis to be precise where it matters the most–at the intrablock and, to a lesser extent, the intraprocedural levels–and use agressive abstraction at the natural abstraction boundary along function calls to achieve scalability. We use symbolic execution to model dynamic features inside basic blocks and use block summaries to hide that complexity from intra- and inter-procedural analysis. We believe the same techniques can be applied easily to other scripting languages (e.g., Perl).
We show how to use our static analysis algorithm to find SQL injection vulnerabilities. Once configured, the analysis is fully automatic. Although we focus on SQL injections in this work, the same techniques can be applied to detecting other vulnerabilities such as cross site scripting (XSS) and code injection in web applications.
We experimentally validate our approach by implementing the analysis algorithm and running it on six popular web applications written in PHP, finding 105 previously unknown security vulnerabilities. We analyzed two reported vulnerabilities in PHP-fusion, a mature, widely deployed content management system, and construct exploits for both that allow an attacker to control or damage the system.²

The rest of the paper is organized as follows. We start with a brief introduction to PHP and show examples of SQL vulnerabilities in web application code (Section 2). We then present our analysis algorithm and show how we use it to find SQL injection vulnerabilities (Section 3). Section 4 describes the implementation, experimental results, and two case studies of exploitable vulnerabilities in PHP-fusion. Section 5 discusses related work and Section 6 concludes.

2 Background

This section briefly introduces the PHP language and shows examples of SQL injection vulnerabilities in PHP.

PHP was created a decade ago by Rasmus Lerdorf as a simple set of Perl scripts for tracking accesses to his online resume. It has since evolved into one of the most popular server-side scripting languages for building web applications. According to a recent Security Space survey, PHP is installed on 44.6% of Apache web servers [16], adopted by millions of developers, and used or supported by Yahoo, IBM, Oracle, and SAP, among others [14].

Although the PHP language has undergone two major re-designs over the past decade, it retains a Perl-like syntax and dynamic (interpreted) nature, which contributes to its most frequently claimed advantage of being simple and flexible.

PHP has a suite of programming constructs and special operations that ease web development. We give three examples:

Natural integration with SQL: PHP provides nearly native support for database operations. For example, using inline variables in strings, most SQL queries can be concisely expressed with a simple function call
```
$rows=mysql_query("UPDATE users SET 
    pass=`$pass' WHERE userid=`$userid'");
```
Contrast this code with Java, where a database is typically accessed through prepared statements: one creates a statement template and fills in the values (along with their types) using bind variables:
```
PreparedStatement s = con.prepareStatement
  ("UPDATE users SET pass = ? 
    WHERE userid = ?");
s.setString(1, pass); s.setInt(2, userid);
int rows = s.executeUpdate();
```
Dynamic types and implicit casting to and from strings: PHP, like other scripting languages, has extensive support for string operations and automatic conversions between strings and other types. These features are handy for web applications because strings serve as the common medium between the browser, the web server, and the database backend. For example, we can convert a number into a string without an explicit cast:
```
if ($userid < 0) exit;
$query = "SELECT * from users 
          WHERE userid = `$userid'";
```
Variable scoping and the environment: PHP has a number of mechanisms that minimize redundancy when accessing values from the execution environment. For example, HTTP get and post requests are automatically imported into the global name space as hash tables $_GET and $_POST. To access the name field of a submitted form, one can simply use $_GET[`name'] directly in the program.

If this still sounds like too much typing, PHP provides an extract operation that automatically imports all key-value pairs of a hash table into the current scope. In the example above, one can use extract(_GET, EXTR_OVERWRITE) to import data submitted using the HTTP get method. To access the $name field, one now simply types $name, which is preferred by some to $_GET[`name'].

However, these conveniences come with security implications:

SQL injection made easy: Bind variables in Java have the benefit of assuring the programmer that any data passed into an SQL query remains data. The same cannot be said for the PHP example where malformed data from a malicious attacker may change the meaning of an SQL statement and cause unintended operations to the database. These are commonly called SQL injection attacks.

In the example above (case 1), suppose $userid is controlled by the attacker and has value
' OR `1' = `1
The query string becomes
```
UPDATE users SET pass='...' 
WHERE userid='' OR '1'='1'
```
which has the effect of updating the password for all users in the database.
Unexpected conversions: Consider the following code:
```
if ($userid == 0) echo $userid;
```
One would expect that if the program prints anything, it should be “0”. Unfortunately, PHP implicitly casts string values into numbers before comparing them with an integer. Non-numerical values (e.g., “abc”) convert to 0 without complaint, so the code above can print anything other than a non-zero number. We can imagine a potential SQL injection vulnerability if $userid is subsequently used to construct an SQL query as in the previous case.
Uninitialized variables under user control: In PHP, uninitialized variables default to null. Some programs rely on this fact for correct behavior; consider the following code:
```
extract($_GET, EXTR_OVERWRITE);
for ($i=0;$i<=7;$i++)
  $new_pass .= chr(rand(97, 122)); // append one char
mysql_query("UPDATE ... $new_pass ...");
```
This program generates a random password and inserts it into the database. However, due to the extract operation on line 1, a malicious user can introduce an arbitrary initial value for $new_pass by adding an unexpected new_pass field into the submitted HTTP form data.

3 Analysis

Given a PHP source file, our tool carries out static analysis in the following steps:

We parse the PHP source into abstract syntax trees (ASTs). Our parser is based on the standard open-source implementation of PHP 5.0.5 [13]. Each PHP source file contains a main section (referred to as the main function hereafter although it is not part of any function definition) and zero or more user-defined functions. We store the user-defined functions in the environment and start the analysis from the main function.
```
CFG := build_control_flow_graph(AST);
foreach (basic_block b in CFG)
  summaries[b] := simulate_block(b);
return make_function_summary(CFG, summaries);
```
Figure 1: Pseudo-code for the analysis of a function.
The analysis of a single function is summarized in Figure 1. For each function in the program, the analysis performs a standard conversion from the abstract syntax tree (AST) of the function body into a control flow graph (CFG). The nodes of the CFG are basic blocks: maximal single entry, single exit sequences of statements. The edges of the CFG are the jump relationships between blocks. For conditional jumps, the corresponding CFG edge is labeled with the branch predicate.
Each basic block is simulated using symbolic execution. The goal is to understand the collective effects of statements in a block on the global state of the program and summarize their effects into a concise block summary (which describes, among other things, the set of variables that must be sanitized³ before entering the block). We describe the simulation algorithm in Section 3.1.
After computing a summary for each basic block, we use a standard reachability analysis to combine block summaries into a function summary. The function summary describes the pre- and post-conditions of a function (e.g., the set of sanitized input variables after calling the current function). We discuss this step in Section 3.2.
During the analysis of a function, we might encounter calls to other user-defined functions. We discuss modeling function calls, and the order in which functions are analyzed, in Section 3.3.

3.1 Simulating Basic Blocks


function simulate_block(BasicBlock b) : BlockSummary
{
  state := init_simulation_state();
  foreach (Statement s in b) {
    state := simulate(s, state);
    if (state.has_returned || state.has_exited)
      break;
  }
  summary := make_block_summary(state);
  return summary;
}

Figure 2: Pseudo-code for intra-block simulation.

3.1.1 Outline

Figure 2 gives pseudo-code outlining the symbolic simulation process. Recall each basic block contains a linear sequence of statements with no jumps or jump targets in the middle. The simulation starts in an initial state, which maps each variable x to a symbolic initial value x₀. It processes each statement in the block in order, updating the simulator state to reflect the effect of that statement. The simulation continues until it encounters any of the following:

the end of the block;
a return statement. In this case, the current block is marked as a return block, and the simulator evaluates and records the return value;
an exit statement. In this case the current block is marked as an exit block;
a call to a user-defined function that exits the program. This condition is automatically determined using the function summary of the callee (see Sections 3.2 and 3.3).

Note that in the last case execution of the program has also terminated and therefore we remove any ensuing statements and outgoing CFG edges from the current block.

After a basic block is simulated, we use information contained in the final state of the simulator to summarize the effect of the block into a block summary, which we store for use during the intraprocedural analysis (see Section 3.2). The state itself is discarded after simulation.

The following subsections describe the simulation process in detail. We start with a definition of the subset of PHP that we model (�3.1.2) and discuss the representation of the simulation state and program values (�3.1.3, �3.1.4) during symbolic execution. Using the value representation, we describe how the analyzer simulates expressions (�3.1.5) and statements (�3.1.6). Finally, we describe how we represent and infer block summaries (�3.1.7).

3.1.2 Language

Type (τ) ::= str | bool | int | ⊤

Const (c) ::= string | int | true | false | null

L-val (lv) ::= x | Arg#i | lv[e]

Expr (e) ::= c | lv | e binop e | unop e | (τ) e

Stmt (S) ::= lv ← e | lv ← f(e₁, …, e_n)

| return e | exit | include e

binop ∈ { +, −, concat, ==, !=, <, >, … }

unop ∈ { −, ¬ }

Figure 3: Language Definition

Value Representation

Loc (l) ::= x | l[string] | l[⊤]

Init-Values (o) ::= l₀

Segment (β) ::= string | contains(σ) | o | ⊥

String (s) ::= ⟨ β₁, ..., β_n ⟩

Boolean (b) ::= true | false | untaint(σ₀, σ₁)

Loc-set (σ) ::= { l₁, ..., l_n }

Integer (i) ::= k

Value (v) ::= s | b | i | o | ⊤

Simulation State

State (Γ) : Loc → Value

(a) Value representation and simulation state.

Locations

Γ ⊢ x

Lv

⇒

x

var

Γ ⊢ Arg#n

Lv

⇒

Arg#n

arg

Γ ⊢ e

E

⇒

l

Γ ⊢ e'

E

⇒

v'    v'' = cast(v', str)

Γ ⊢ e[e']

Lv

⇒

⎧
⎨
⎩

l[α]    if v'' = ⟨“α”⟩

l[⊤]    otherwise

dim

(b) L-values.

Expressions

Type casts:

cast(k, bool) = ⎧
⎨
⎩

true    if k ≠ 0

false    otherwise

cast(true, str) = ⟨“1”⟩

cast(false, str) = ⟨⟩

cast(v = ⟨ β₁, ..., β_n ⟩, bool)

       = ⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩

true

if (v ≠ ⟨“0” ⟩) ∧

n

∨

i=1
¬ is_empty(β_i)

false

if (v = ⟨“0” ⟩) ∨

n

∧

i=1
is_empty(β_i)

⊤    otherwise

…

Evaluation Rules:

Γ ⊢ lv

Lv

⇒

l

Γ ⊢ lv

E

⇒

Γ(l)

L-val

Γ ⊢ e₁

E

⇒

v₁    cast(v₁, str) = ⟨ β₁, ..., β_n ⟩

Γ ⊢ e₂

E

⇒

v₂    cast(v₂, str) = ⟨ β_n+1, ..., β_m ⟩

Γ ⊢ e₁ concat e₂

E

⇒

⟨ β₁, ..., β_m ⟩

concat

Γ ⊢ e

E

⇒

v    cast(v, bool) = v'

Γ ⊢ ¬ e

E

⇒

⎧
⎪
⎨
⎪
⎩

true      if v' = false

false      if v' = true

untaint(σ₁, σ₀)      if v' = untaint(σ₀, σ₁)

⊤      otherwise



not

(c) Expressions.

Figure 4: Intrablock simulation algorithm.

Figure 3 gives the definition of a small imperative language that captures a subset of PHP constructs that we believe is relevant to SQL injection vulnerabilities. Like PHP, the language is dynamically typed. We model three basic types of PHP values: strings, booleans and integers. In addition, we introduce a special ⊤ type to describe objects whose static types are undetermined (e.g., input parameters).⁴

Expressions can be constants, l-values, unary and binary operations, and type casts. The definition of l-values is worth mentioning because in addition to variables and function parameters, we include a named subscript operation to give limited support to the array and hash table accesses used extensively in PHP programs.

A statement can be an assignment, function call, return, exit, or include. The first four statement types require no further explanation. The include statement is a commonly used feature unique to scripting languages, which allows programmers to dynamically insert code into the program. In our language, include evaluates its string argument, and executes the program file designated by the string as if it is inserted at that program point (e.g., it shares the same scope). We describe how we simulate such behavior in Section 3.1.6.

3.1.3 State

Figure 4(a) gives the definition of values and states during simulation. The simulation state maps memory locations to their value representations, where a memory location is either a program variable (e.g. x), or an entry in a hash table accessed via another location (e.g. x[key]). Note the definition of locations is recursive, so multi-level hash dereferences are supported in our algorithm.

On entry to the function, each location l is implicitly initialized to a symbolic initial value l₀, which makes up the initial state of the simulation. The values we represent in the state can be classified into three categories based on type:

Strings: Strings are the most fundamental type in many scripting languages, and precision in modeling strings directly determines analysis precision. Strings are typically constructed through concatenation. For example, user inputs (via HTTP get and post methods) are often concatenated with a pre-constructed skeleton to form an SQL query. Similarly, results from the query can be concatenated with HTML templates to form output. Modeling concatenation well enables an analysis to better understand information flow in a script. Thus, our string representations is based on concatenation. String values are represented as an ordered concatenation of string segments, which can be one of the following: a string constant, the initial value of a memory location on entry to the current block (l₀), or a string that contains initial values of zero or more elements from a set of memory locations (contains(σ)). We use the last representation to model return values from function calls, which may non-deterministically contain a combination of global variables and input parameters. For example, in


function f($a, $b) {
   if (...) return $a;
   else return $b;
}
$ret = f($x.$y, $z);

we represent the return value on line 5 as contains({x, y, z}) to model the fact that it may contain any element in the set as a sub-string.

The string representation described above has the following benefits:

First, we get automatic constant folding for strings within the current block, which is often useful for resolving hash keys and distinguishing between hash references (e.g., in $key = “key”; return $hash[$key];).

Second, we can track how the contents of one input variable flow into another by finding occurrences of initial values of the former in the final representation of the latter. For example, in: $a = $a . $b, the final representation of $a is ⟨a₀, b₀⟩. We know that if either $a or $b contains unsanitized user input on entry to the current block, so does $a upon exit.

Finally, interprocedural dataflow is possible by tracking function return values based on function summaries using contains(σ). We describe this aspect in more detail in Section 3.3.

Booleans: In PHP, a common way to perform input validation is to call a function that returns trueor falsedepending on whether the input is well-formed or not. For example, the following code sanitizes $userid:


$ok = is_safe($userid);
if (!$ok) exit;

The value of Boolean variable $ok after the call is undetermined, but it is correlated with the safety of $userid. This motivates untaint(σ₀, σ₁) as a representation for such Booleans: σ₀ (resp. σ₁) represents the set of validated l-values when the Boolean is false(resp. true). In the example above, $ok has representation untaint({}, {userid}).

Besides untaint, representation for Booleans also include constants (trueand false) and unknown (⊤).

Integers: Integer operations are less emphasized in our simulation. We track integer constants and binary and unary operations between them. We also support type casts from integers to Boolean and string values.

3.1.4 Locations and L-values

In the language definition in Figure 3, hash references may be aliased through assignments and l-values may contain hash accesses with non-constant keys. The same l-value may refer to different memory locations depending on the value of both the host and the key, and therefore, l-values are not suitable as memory locations in the simulation state.

Figure 4(b) gives the rules we use to resolve l-values into memory locations. The var and arg rules map each program variable and function argument to a memory location identified by its name, and the dim rule resolves hash accesses by first evaluating the hash table to a location and then appending the key to form the location for the hash entry.

These rules are designed to work in the presence of simple aliases. Consider the following program:


$hash = $_POST;
$key = 'userid';
$userid = $hash[$key];

The program first creates an alias ($hash) to hash table $_POST and then accesses the userid entry using that alias. On entry to the block, the initial state maps every location to its initial value:

Γ = { hash ⇒hash₀ , key ⇒key₀, _POST ⇒_POST₀,

_POST[userid] ⇒ _POST[userid]₀ }

According to the var rule, each variable maps to its own unique location. After the first two assignments, the state is:

Γ = { hash ⇒ _POST₀, key ⇒ ⟨`userid'⟩, … }

We use the dim rule to resolve $hash[$key] on line 3: $hash evaluates to _POST₀, and $key evaluates to constant string 'userid'. Therefore, the l-value $hash[$key] evaluates to location _POST[userid], and thus the analysis assigns the desired value _POST[userid]₀ to $userid.

3.1.5 Expressions

We perform abstract evaluation of expressions based on the value representation described above. Because PHP is a dynamically typed language, operands are implicitly cast to appropriate types for operations in an expression. Figure 4(c) gives a representative sample of cast rules simulating cast operations in PHP. For example, Boolean value true, when used in a string context, evaluates to “1”. false, on the other hand, is converted to the empty string instead of “0”. In cases where exact representation is not possible, the result of the cast is unknown (⊤).

Figure 4(c) also gives three representative rules for evaluating expressions. The first rule handles l-values, and the result is obtained by first resolving the l-value into a memory location, and then looking up the location in the evaluation context (recall that Γ(l) = l₀ on entry to the block).

The second rule models string concatenation. We first cast the value of both operands to string values, and the result is the concatenation of both.

The final rule handles Boolean negation. The interesting case involves untaint values. Recall that untaint(σ₀, σ₁) denotes an unknown Boolean value that is false(resp. true) if l-values in the set σ₀ (resp. σ₁) are sanitized. Given this definition, the negation of untaint(σ₀, σ₁) is untaint(σ₁, σ₀).

The analysis of an expression is ⊤ if we cannot determine a more precise representation, which is a potential source of false negatives.

3.1.6 Statements

We model assignments, function calls, return, exit, and include statements in the program. The assignment rule resolves the left-hand side to a memory location l, and evaluates the right-hand side to a value v. The updated simulation state after the assignment maps l to the new value v:

Γ ⊢ lv

Lv

⇒

l Γ ⊢ e

E

⇒

Γ ⊢ lv ← e

S

⇒

Γ[l ↦ v]

assignment

Function calls are similar. The return value of a function call f(e₁, …, e_n) is modeled using either contains(σ) (if f returns a string) or untaint(σ₀, σ₁) (if f returns a Boolean) depending on the inferred summary for f. We defer discussion of the function summaries and the return value representation to Sections 3.2 and 3.3. For the purpose of this section, we use the uninterpreted value f(v₁, …, v_n) as a place holder for the actual representation of the return value:

Γ ⊢ lv

Lv

⇒

l Γ ⊢ e₁

E

⇒

v₁ … Γ ⊢ e_n

E

⇒

v_n

Γ ⊢ lv ← f(e₁, …, e_n)

S

⇒

Γ[l ↦ f(v₁, …, v_n) ]

fun

In addition to the return value, certain functions have pre- and post-conditions depending on the operation they perform. Pre- and post-conditions are inferred and stored in the callee's summary, which we describe in detail in Sections 3.2 and 3.3. Here we show two examples to illustrate their effects:


function validate($x) {
 if (!is_numeric($x)) exit;
 return;
}
function my_query($q) {
  global $db;
  mysql_db_query($db, $q);
}
validate($a.$b);
my_query("SELECT...WHERE a = '$a' AND c = '$c'");

The validate function tests whether the argument is a number (and thus safe) and aborts if it is not. Therefore, line 9 sanitizes both $a and $b. We record this fact by inspecting the value representation of the actual parameter (in this case ⟨a₀, b₀⟩), and remembering the set of non-constant segments that are sanitized.

The second function my_query uses its argument as a database query string by calling mysql_db_query. To prevent SQL injection attacks, any user input must be sanitized before it becomes part of the first parameter. Again, we enforce this requirement by inspecting the value representation of the actual parameter. We record any unsanitized non-constant segments (in this case $c, since $a is sanitized on line 9) and require they be sanitized as part of the pre-condition for the current block.

Sequences of assignments and function calls are simulated by using the output environment of the previous statement as the input environment of the current statement:

Γ ⊢ s₁

S

⇒

Γ' Γ' ⊢ s₂

S

⇒

Γ''

Γ ⊢ (s₁; s₂)

S

⇒

Γ''

seq

The final simulation state is the output state of the final statement.

The return and exit statements terminate control flow⁵ and require special treatment. For a return, we evaluate the return value and use it in calculating the function summary. In case of an exit statement, we mark the current block as an exit block.

Finally, include statements are a commonly used feature unique to scripting languages allowing programmers to dynamically insert code and function definitions from another script. In PHP, the included code inherits the variable scope at the point of the include statement. It may introduce new variables and function definitions, and change or sanitize existing variables before the next statement in the block is executed.

We process include statements by first parsing the included file, and adding any new function definitions to the environment. We then splice the control flow graph of the included main function at the current program point by a) removing the include statement, b) breaking the current basic block into two at that point, c) linking the first half of the current block to the start of the main function, and all return blocks (those containing a return statement) in the included CFG to the second half, and d) replacing the return statements in the included script with assignments to reflect the fact that control flow resumes in the current script.

3.1.7 Block summary

The final step for the symbolic simulator is to characterize the behavior of a CFG block into a concise summary. A block summary is represented as a six-tuple ⟨ E, D, F, T, R, U ⟩:

Error set ( E): the set of input variables that must be sanitized before entering the current block. These are accumulated during simulation of function calls that require sanitized input.
Definitions ( D): the set of memory locations defined in the current block. For example, in
```
$a = $a.$b; $c = 123;
```
we have D = { a, c }.
Value flow ( F): the set of pairs of locations (l₁, l₂) where the string value of l₁ on entry becomes a sub-string of l₂ on exit. In the example above, F = { (a, a), (b, a) }.
Termination predicate ( T): trueif the current block contains an exit statement, or if it calls a function that causes the program to terminate.
Return value ( R): records the representation for the return value if any, undefined otherwise. Note that if the current block has no successors, either R has a value or T is true.
Untaint set ( U): for each successor of the current CFG block, we compute the set of locations that are sanitized if execution continues onto that block. Sanitization can occur via function calls, casting to safe types (e.g., int, etc), regular expression matching, and other tests. The untaint set for different successors might differ depending on the value of branch predicates. We show an example below.
```
validate($a);
$b = (int) $c;
if (is_numeric($d))
      ...
```
As mentioned earlier, validate exits if $a is unsafe. Casting to integer also returns a safe result. Therefore, the untaint set is { a, b, d } for the true branch, and { a, b } for the false branch.

3.2 Intraprocedural Analysis

Based on block summaries computed in the previous step, the intraprocedural analysis computes the following summary ⟨ E, R, S, X ⟩ for each function:

Error set ( E): the set of memory locations (variables, parameters, and hash accesses) whose value may flow into a database query, and therefore must be sanitized before invoking the current function. For the main function, the error set must not include any user-defined variables (e.g. $_GET[`...'] or $_POST[`...'])—the analysis emits an error message for each such violation.

We compute E by a backwards reachability analysis that propagates the error set of each block (using the E, D, F, and U components in the block summaries) to the start block of the function.
Return set ( R): the set of parameters or global variables whose value may be a substring of the return value of the function. R is only computed for functions that may return string values. For example, in the following code, the return set includes both function arguments and the global variable $table (i.e., R = { table, Arg#1, Arg#2 }).
```
function make_query($user, $pass) {
  global $table;
  return "SELECT * from $table ".
    "where user = $user and pass = $pass";
}
```
We compute the function return set by using a forward reachability analysis that expresses each return value (recorded in the block summaries as R) as a set of function parameters and global variables.
Sanitized values ( S): the set of parameters or global variables that are sanitized on function exit. We compute the set by using a forward reachability analysis to determine the set of sanitized inputs at each return block, and we take the intersection of those sets to arrive at the final result.

If the current function returns a Boolean value as its result, we distinguish the sanitized value set when the result is trueversus when it is false(mirroring the untaint representation for Boolean values above). The following example motivates this distinction:
```
function is_valid($x) {
  if (is_numeric($x)) return true;
  return false;
}
```
The parameter is sanitized if the function returns true, and the return value is likely to be used by the caller to determine the validity of user input. In the example above,
S = ( false ⇒ {}, true ⇒ { Arg#1 })
For comparison, the validate function defined previously has S = ( * ⇒ { Arg#1 }). In the next section, we describe how we make use of this information in the caller.
Program Exit ( X): a Boolean which indicates whether the current function terminates program execution on all paths. Note that control flow can leave a function either by returning to the caller or by terminating the program. We compute the exit predicate by enumerating over all CFG blocks that have no successors, and identify them as either return blocks or exit blocks (the T and R component in the block summary). If there are no return blocks in the CFG, the current function is an exit function.

The dataflow algorithms used in deriving these facts are fairly standard fix-point computations. We omit the details for brevity.

3.3 Interprocedural Analysis

This section describes how we conduct interprocedural analysis using summaries computed in the previous step. Assuming f has summary ⟨ E, R, S, X ⟩, we process a function call f(e₁, …, e_n) during intrablock simulation as follows:

Pre-conditions: We use the error set ( E) in the function summary to identify the set of parameters and global variables that must be sanitized before calling this function. We substitute actual parameters for formal parameters in E and record any unsanitized non-constant segments of strings in the error set as the sanitization pre-condition for the current block.
Exit condition: If the callee is marked as an exit function (i.e., X is true), we remove any statements that follow the call and delete all outgoing edges from the current block. We further mark the current block as an exit block.
Post-conditions: If the function unconditionally sanitizes a set of input parameters and global variables, we mark this set of values as safe in the simulation state after substituting actual parameters for formal parameters.

If sanitization is conditional on the return value (e.g., the is_valid function defined above), we record the intersection of its two component sets as being unconditionally sanitized (i.e., σ₀ ∩ σ₁ if the untaint set is (false ⇒ σ₀, true ⇒ σ₁)).

Return value: If the function returns a Boolean value and it conditionally sanitizes a set of input parameters and global variables, we use the untaint representation to model that correlation:

Γ ⊢ lv

Lv

⇒

l Γ ⊢ e₁

E

⇒

v₁ … Γ ⊢ e_n

E

⇒

v_n

Summary(f) = ⟨ E, R, S, X ⟩

S = (false ⇒ σ₀, true ⇒ σ₁) σ_* = σ₀ ∩ σ₁

σ₀' = subst

(σ₀ − σ_*) σ₁' = subst

(σ₁ − σ_*)

Γ ⊢ lv ← f(e₁, …, e_n)

S

⇒

Γ[l ↦ untaint(σ₀', σ₁')]

In the rule above, subst_v(σ) substitutes actual parameters (v_i) for formal parameters in σ.

If the callee returns a string value, we use the return set component of the function summary ( R) to determine the set of input parameters and global variables that might become a substring of the return value:

Γ ⊢ lv

Lv

⇒

l Γ ⊢ e₁

E

⇒

v₁ … Γ ⊢ e_n

E

⇒

v_n

Summary(f) = ⟨ E, R, S, X ⟩ σ' = subst

( R)

Γ ⊢ lv ← f(e₁, …, e_n)

S

⇒

Γ[l ↦ contains(σ')]

Since we require the summary information of a function before we can analyze its callers, the order in which functions are analyzed is important. Due to the dynamic nature of PHP (e.g., include statements), we analyze functions on demand—a function f is analyzed and summarized when we first encounter a call to f. The summary is then memoized to avoid redundant analysis. Effectively, our algorithm analyzes the source codebase in topological order based on the static function call graph. If we encounter a cycle during the analysis, the current implementation uses a dummy “no-op” summary as a model for the second invocation (i.e., we do not compute fix points for recursive functions). In theory, this is a potential source of false negatives, which can be removed by adding a simple iterative algorithm that handles recursion. However, practically, such an algorithm may be unnecessary given the rare occurrence of recursive calls in PHP programs.

4 Experimental Results

The analysis described in Section 3 has been implemented as two separate parts: a frontend based on the open source PHP 5.0.5 distribution that parses the source files into abstract syntax trees and a backend written in OCaml [8] that reads the ASTs into memory and carries out the analysis. This separation ensures maximum compatibility while minimizing dependence on the PHP implementation.

The decision to use different levels of abstraction in the intrablock, intraprocedural, and interprocedural levels enabled us to fine tune the amount of information we retain at one level independent of the algorithm used in another and allowed us to quickly build a usable tool. The checker is largely automatic and requires little human intervention for use. We seed the checker with a small set of query functions (e.g. mysql_query) and sanitization operations (e.g. is_numeric). The checker infers the rest automatically.

Regular expression matching presents a challenge to automation. Regular expressions are used for a variety of purposes including, but not limited to, input validation. Some regular expressions match well-formed input while others detect malformed input; assuming one way or the other results in either false positives or false negatives. Our solution is to maintain a database of previously seen regular expressions and their effects, if any. Previously unseen regular expressions are assumed by default to have no sanitization effects, so as not to miss any errors due to incorrect judgment. To make it easy for the user to specify the sanitization effects of regular expressions, the checker has an interactive mode where the user is prompted when the analysis encounters a previously unseen regular expression and the user's answers are recorded for future reference.⁶ Having the user declare the role of regular expressions has the real potential to introduce errors into the analysis; however, practically, we found this approach to be very effective and it helped us find at least two vulnerabilities caused by overly lenient regular expressions being used for sanitization.⁷ Our tool collected information for 49 regular expressions from the user over all our experiments (the user replies with one keystroke for each inquiry), so the burden on the user is minimal.

The checker detects errors by using information from the summary of the main function—the checker marks all variables that are required to be sanitized on entry as potential security vulnerabilities. From the checker's perspective, these variables are defined in the environment and used to construct SQL queries without being sanitized. In reality, however, these variables are either defined by the runtime environment or by some language constructs that the checker does not fully understand (e.g., the extract operation in PHP which we describe in a case study below). The tool emits an error message if the variable is known to be controlled by the user (e.g. $_GET[`…'], $_POST[`…'], $_COOKIE[`…'], etc). For others, the checker emits a warning.

We conducted our experiments on the latest versions of six open source PHP code bases: e107 0.7, Utopia News Pro 1.1.4, mybloggie 2.1.3beta, DCP Portal v6.1.1, PHP Webthings 1.4patched, and PHP fusion 6.00.204. Table 1 summarizes our findings for the first five. The analysis terminates within seconds for each script examined (which may dynamically include other source files). Our checker emitted a total of 99 error messages for the first five applications, where unsanitized user input (from $_GET, $_POST, etc) may flow into SQL queries. We manually inspected the error reports and believe all 99 represent real vulnerabilities.⁸ We have notified the developers about these errors and will publish security advisories once the errors have been fixed. We have not inspected warning messages—unsanitized variables of unresolved origin (e.g. from database queries, configuration files, etc) that are subsequently used in SQL queries due to the high likelihood of false positives.

Application (KLOC) Err Msgs Bugs (FP) Warn

News Pro (6.5) 8 8 (0) 8

myBloggie (9.2) 16 16 (0) 23

PHP Webthings (38.3) 20 20 (0) 6

DCP Portal (121) 39 39 (0) 55

e107 (126) 16 16 (0) 23

Total 99 99 (0) 115

Table 1: Summary of experiments. LOC statistics include embedded HTML, and thus is a rough estimate of code complexity. Err Msgs: number of reported errors. Bugs: number of confirmed bugs from error reports. FP: number of false positives. Warn: number of unique warning messages for variables of unresolved origin (uninspected).

PHP-fusion is different from the other five code bases because it does not directly access HTTP form data from input hash tables such as $_GET and $_POST. Instead it uses the extract operation to automatically import such information into the current variable scope. We describe our findings for PHP-fusion in the following subsection.

4.1 Case Study: Two Exploitable SQL Injection Attacks in PHP-fusion

In this section, we show two case studies of exploitable SQL injection vulnerabilities in PHP-fusion detected by our tool. PHP-fusion is an open-source content management system (CMS) built on PHP and MySQL. Excluding locale specific customization modules, it consists of over 16,000 lines of PHP code and has a wide user-base because of its speed, customizability and rich features. Browsing through the code, it is obvious that the author programmed with security in mind and has taken extra care in sanitizing input before use in query strings.

Our experiments were conducted on the then latest 6.00.204 version of the software. Unlike other code bases we have examined, PHP-fusion uses the extract operation to import user input into the current scope. As an example, extract($_POST, EXTR_OVERWRITE) has the effect of introducing one variable for each key in the $_POST hash table into the current scope, and assigning the value of $_POST[key] to that variable. This feature reduces typing, but introduces confusion for the checker and security vulnerabilities into the software—both of the exploits we constructed involve use of uninitialized variables whose values can be manipulated by the user because of the extract operation.

Since PHP-fusion does not directly read user input from input hashes such as $_GETor $_POST, there are no direct error messages generated by our tool. Instead we inspect warnings (recall the discussion about errors and warnings above), which correspond to security sensitive variables whose definition is unresolved by the checker (e.g., introduced via the extract operation, or read from configuration files).

We ran our checker on all top level scripts in PHP-fusion. The tool generated 22 unique warnings, a majority of which relate to configuration variables that are used in the construction of a large number of queries.⁹ After filtering those out, 7 warnings in 4 different files remain.

We believe all but one of the 7 warnings may result in exploitable security vulnerabilities. The lone false positive arises from an unanticipated sanitization:


/* php-files/lostpassword.php */
if (!preg_match("/^[0-9a-z]{32}$/", $account)) 
        $error = 1;
if (!$error) { /* database access using $account */ }
if ($error) redirect("index.php");

Instead of terminating the program immediately based on the result from preg_match, the program sets the $error flag to trueand delays error handling, which is in general not a good practice. This idiom can be handled by adding slightly more information in the block summary.

We investigated the first two of the remaining warnings for potential exploits and confirmed that both are indeed exploitable on a test installation. Unsurprisingly both errors are made possible because of the extract operation. We explain these two errors in detail below.

1) Vulnerability in script for recovering lost password. This is a remotely exploitable vulnerability that allows any registered user to elevate his privileges via a carefully constructed URL. We show the relevant code below:


/* php-files/lostpassword.php */
for ($i=0;$i<=7;$i++)
   $new_pass .= chr(rand(97, 122)); 
...
$result = dbquery("UPDATE ".$db_prefix."users 
  SET user_password=md5('$new_pass') 
  WHERE user_id='".$data['user_id']."'");

Our tool issued a warning for $new_pass, which is uninitialized on entry and thus defaults to the empty string during normal execution. The script proceeds to add seven randomly generated letters to $new_pass (lines 2-3), and uses that as the new password for the user (lines 5-7). The SQL request under normal execution takes the following form:


UPDATE users SET user_password=md5('???????') 
  WHERE user_id='userid'

However, a malicious user can simply add a new_pass field to his HTTP request by appending, for example, the following string to the URL for the password reminder site:
&new_pass=abc%27%29%2cuser_level=%27103%27%2cuser_aim=%28%27
The extract operation described above will magically introduce $new_pass in the current variable scope with the following initial value:
abc'), user_level='103', user_aim=('
The SQL request is now constructed as:


UPDATE users SET user_password=md5('abc'), 
      user_level='103', user_aim=('???????') 
  WHERE user_id='userid'

Here the password is set to “abc”, and the user privilege is elevated to 103, which means “Super Administrator.” The newly promoted user is now free to manipulate any content on the website.

2) Vulnerability in the messaging sub-system. This vulnerability exploits another use of potentially uninitialized variable $result_where_message_id in the messaging sub system. We show the relevant code in Figure 5.


if (isset($msg_view)) {
  if (!isNum($msg_view)) fallback("messages.php");
  $result_where_message_id="message_id=".$msg_view;
} elseif (isset($msg_reply)) {
  if (!isNum($msg_reply)) fallback("messages.php");
  $result_where_message_id="message_id=".$msg_reply;
}
... /* ~100 lines later */ ...
} elseif (isset($_POST['btn_delete']) || 
  isset($msg_delete)) { // delete message
  $result = dbquery("DELETE FROM ".$db_prefix.
   "messages WHERE ".$result_where_message_id. // BUG
   " AND ".$result_where_message_to);

Figure 5: An exploitable vulnerability in PHP-fusion 6.00.204.

Our tool warns about unsanitized use of $result_where_message_id. On normal input, the program initializes $result_where_message_id using a cascading if statement. As shown in the code, the author is very careful about sanitizing values that are used to construct $result_where_message_id. However, the cascading sequence of if statements does not have a default branch. And therefore, $result_where_message_id might be uninitialized on malformed input. We exploit this fact, and append
&request_where_message_id=1=1/*
The query string submitted on line 11-13 thus becomes:


DELETE FROM messages WHERE 1=1 /* AND ...

Whatever follows “/*” is treated as comments in MySQL and thus ignored. The result is loss of all private messages in the system. Due to the complex control and data flow, this error is unlikely to be discovered via code review or testing.

We reported both exploits to the author of PHP-fusion, who immediately fixed these vulnerabilities and released a new version of the software.

5 Related Work

5.1 Static techniques

WebSSARI is a type-based analyzer for PHP [7]. It uses a simple intraprocedural tainting analysis to find cases where user controlled values flow into functions that require trusted input (i.e. sensitive functions). The analysis relies on three user written “prelude” files to provide information regarding: 1) the set of all sensitive functions–those require sanitized input; 2) the set of all untainting operations; and 3) the set of untrusted input variables. Incomplete specification results in both substantial numbers of false positives and false negatives.

WebSSARI has several key limitations that restrict the precision and analysis power of the tool:

WebSSARI uses an intraprocedural algorithm and thus only models information flow that does not cross function boundaries.

Large PHP codebases typically define a number of application specific subroutines handling common operations (e.g., query string construction, authentication, sanitization, etc) using a small number of system library functions (e.g., mysql_query). Our algorithm is able to automatically infer information flow and pre- and post-conditions for such user-defined functions whereas WebSSARI relies on the user to specify the constraints of each, a significant burden that needs to be repeated for each source codebase examined. Examples in Section 3.3 represent some common forms of user-defined functions that WebSSARI is not able to model without annotations.

To show how much interprocedural analysis improves the accuracy of our analysis, we turned off function summaries and repeated our experiment on News Pro, the smallest of the five codebases. This time, the analysis generated 19 error messages (as opposed to 8 with interprocedural analysis). Upon inspection, all 11 extra reports are false positives due to user-defined sanitization operations.
WebSSARI does not seem to model conditional branches, which represent one of the most common forms of sanitization in the scripts we have analyzed. For example, we believe it will report a false warning on the following code:
```
if (!is_numeric($_GET['x'])) 
    exit;
mysql_query(``... $_GET['x'] ...'');
```
Furthermore, interprocedural conditional sanitization (see the example in Section 3.1.6) is also fairly common in codebases.
WebSSARI uses an algorithm based on static types that does not specifically model dynamic features in scripts. For example, dynamic typing may introduce subtle errors that WebSSARI misses. The include statement, used extensively in PHP scripts, dynamically inserts code to the program which may contain, induce, or prevent errors.

We are unable to directly compare the experimental results due to the fact that neither the bug reports nor the WebSSARI tool are available publicly. Nor are we able to compare false positive rates since WebSSARI reports per-file statistics which may underestimate the false positive ratio. A file with 100 false positives and 1 real bug is considered to be “vulnerable” and therefore does not contribute to the false positive rate computed in [7].

Livshits and Lam [9] develop a static detector for security vulnerabilities (e.g., SQL injection, cross site scripting, etc) in Java applications. The algorithm uses a BDD-based context-sensitive pointer analysis [19] to find potential flow from untrusted sources (e.g., user input) to trusting sinks (e.g., SQL queries). One limitation of this analysis is that it does not model control flow in the program and therefore may misflag sanitized input that subsequently flows into SQL queries. Sanitization with conditional branching is common in PHP programs, so techniques that ignore control flow are likely to cause large numbers of false positives on such code bases.

Other tainting analysis that are proven effective on C code include CQual [4], MECA [21], and MC [6, 2]. Collectively they have found hundreds of previously unknown security errors in the Linux kernel.

Christensen et. al. [3] develop a string analysis that approximates string values in a Java program using a context free grammar. The result is widened into a regular language and checked against a specification of expected output to determine syntactic correctness. However, syntactic correctness does not entail safety, and therefore it is unclear how to adapt this work to the detection of SQL injection vulnerabilities. Minamide [10] extends the approach and construct a string analyzer for PHP, citing SQL injection detection as a possible application. However, the analyzer models a small set of string operations in PHP (e.g., concatenation, string matching and replacement) and ignores more complex features such as dynamic typing, casting, and predicates. Furthermore, the framework only seems to model sanitization with string replacement, which represents a small subset of all sanitization in real code. Therefore, accurately pinpointing injection attacks remains challenging.

Gould et. al. [5] combines string analysis with type checking to ensure not only syntactic correctness but also type correctness for SQL queries constructed by Java programs. However, type correctness does not imply safety, which is the focus of our analysis.

5.2 Dynamic Techniques

Scott and Sharp [15] propose an application-level firewall to centralize sanitization of client input. Firewall products are also commercially available from companies such as NetContinuum, Imperva, Watchfire, etc. Some of these firewalls detect and guard against previously known attack patterns, while others maintain a white list of valid inputs. The main limitation here is that the former is susceptible to both false positives and false negatives, and the latter is reliant on correct specifications, which are difficult to come by.

The Perl taint mode [12] enables a set of special security checks during execution in an unsafe environment. It prevents the use of untrusted data (e.g., all command line arguments, environment variables, data read from files, etc) in operations that require trusted input (e.g., any command that invokes a sub-shell). Nguyen-Tuong [11] proposes a taint mode for PHP, which, unlike the Perl taint mode, not define sanitizing operations. Instead, it tracks each character in the user input individually, and employs a set of heuristics to determine whether a query is safe when it contains fragments of user input. For example, among others, it detects an injection if an operator symbol (e.g., “(”, “)”, “%”, etc) is marked as tainted. This approach is susceptible to both false positives and false negatives. Note that static analyses are also susceptible to both false positives and false negatives. The key distinction is that in static analyses, inaccuracies are resolved at compile time instead of at runtime, which is much less forgiving.

6 Conclusion

We have presented a static analysis algorithm for detecting security vulnerabilities in PHP. Our analysis employs a novel three-tier architecture that enables us to handle dynamic features unique to scripting languages such as dynamic typing and code inclusion. We demonstrate the effectiveness of our approach by running our tool on six popular open source PHP code bases and finding 105 previously unknown security vulnerabilities, most of which we believe are remotely exploitable.

Acknowledgement

This research is supported in part by NSF grants SA4899-10808PG, CCF-0430378, and an IBM Ph.D. fellowship. We would like to thank our shepherd Andrew Myers and the anonymous reviewers for their helpful comments and feedback.

References

[1]: A. Aiken, E. Wimmers, and T. Lakshman. Soft typing with conditional types. In Proceedings of the 21st Annual Symposium on Principles of Programming Languages, 1994.
[2]: K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In 2002 IEEE Symposium on Security and Privacy, 2002.
[3]: A. Christensen, A. Moller, and M. Schwartzbach. Precise analysis of string expressions. In Proceedings of the 10th Static Analysis Symposium, 2003.
[4]: J. S. Foster, T. Terauchi, and A. Aiken. Flow-sensitive type qualifiers. In Proceedings of the 2002 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1–12, June 2002.
[5]: C. Gould, Z. Su, and P. Devanbu. Static checking of dynamically generated queries in database applications. In Proceedings of the 26th International Conference on Software Engineering, 2004.
[6]: S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system and language for building system-specific, static analyses. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin, Germany, June 2002.
[7]: Y.-W. Huang, F. Yu, C. Hang, C.-H. Tsai, D. Lee, and S.-Y. Kuo. Securing web application code by static analysis and runtime protection. In Proceedings of the 13th International World Wide Web Conference, 2004.
[8]: X. Leroy, D. Doligez, J. Garrigue, and J. Vouillon. The Objective Caml system. Software and documentation available on the web, http://caml.inria.fr.
[9]: V. Livshits and M. Lam. Finding security vulnerabilities in Java applications with static analysis. In Proceedings of the 14th Usenix Security Symposium, 2005.
[10]: Y. Minamide. Approximation of dynamically generated web pages. In Proceedings of the 14th International World Wide Web Conference, 2005.
[11]: A. Nguyen-Tuong, S. Guarnieri, D. Greene, J. Shirley, and D. Evans. Automatically hardening web applications using precise tainting. In Proceedings of the 20th International Information Security Conference, 2005.
[12]: Perl documentation: Perlsec. http://search. cpan.org/dist/perl/pod/perlsec.pod.
[13]: PHP: Hypertext Preprocessor. http://www.php. net.
[14]: PHP usage statistics. http://www.php.net/ usage.php.
[15]: D. Scott and R. Sharp. Abstracting application-level web security. In Proceedings of the 11th International World Wide Web Conference, 2002.
[16]: Security space apache module survey (Oct 2005). http://www.securityspace.com/s_survey/ data/man.200510/apachemods.html.
[17]: Symantec Internet security threat report: Vol. VII. Technical report, Symantec Inc., Mar. 2005.
[18]: TIOBE programming community index for November 2005.
http://www.tiobe.com/tpci.htm.
[19]: J. Whaley and M. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, 2004.
[20]: A. Wright and R. Cartwright. A practical soft type system for Scheme. ACM Trans. Prog. Lang. Syst., 19(1):87–152, Jan. 1997.
[21]: J. Yang, T. Kremenek, Y. Xie, and D. Engler. MECA: an extensible, expressive system and language for statically checking security properties. In Proceedings of the 10th Conference on Computer and Communications Security, 2003.

1: Installed on over 23 million Internet domains [14], and is ranked fourth on the TIOBE programming community index [18].
2: Both vulnerabilities have been reported to and fixed by the PHP-fusion developers.
3: Sanitization is an operation that ensures that user input can be safely used in an SQL query (e.g., no unescaped quotes or spaces).
4: In general, in a dynamically typed language, a more precise static approximation in this case would be a sum (aka. soft typing) [1, 20]. We have not found it necessary to use type sums in this work.
5: So do function calls that exits the program, in which case we remove any ensuing statements and outgoing edges from the current CFG block. See Section 3.3.
6: Here we assume that a regular expression used to sanitize input in one context will have the same effect in another, which, based on our experience, is the common case. Our implementation now provides paranoid users with a special switch that ignores recorded answers and repeatedly ask the user the same question over and over if so desired.
7: For example, Utopia News Pro misused “[0-9]+” to validate some user input. This regular expression only checks that the string contains a number, instead of ensuring that the input is actually a number. The correct regular expression in this case is “^[0-9]+$”.
8: Information about the results, along with the source codebases, are available online at:
http://glide.stanford.edu/yichen/research/.
9: Database configuration variables such as $db_prefix accounted for 3 false positives, and information derived from the database queries and configuration settings (e.g. locale settings) caused the remaining 12.

This document was translated from L^AT_EX by H^EV^EA.

Last changed: 21 June 2006 jel

Static Detection of Security Vulnerabilities in Scripting Languages

Yichen Xie Alex Aiken Computer Science Department Stanford University Stanford, CA 94305 {yxie,aiken}@cs.stanford.edu

1 Introduction

2 Background

3 Analysis

3.1 Simulating Basic Blocks

3.1.1 Outline

3.1.2 Language

3.1.3 State

3.1.4 Locations and L-values

3.1.5 Expressions

3.1.6 Statements

3.1.7 Block summary

3.2 Intraprocedural Analysis

3.3 Interprocedural Analysis

4 Experimental Results

4.1 Case Study: Two Exploitable SQL Injection Attacks in PHP-fusion

5 Related Work

5.1 Static techniques

5.2 Dynamic Techniques

6 Conclusion

Acknowledgement

References

Static Detection of Security Vulnerabilities
in Scripting Languages

Yichen Xie Alex Aiken
Computer Science Department
Stanford University
Stanford, CA 94305
{yxie,aiken}@cs.stanford.edu