Bouwers > Eric > Blog: August 2006

Introducing eval-php

After spending a weekend in the woods, hearing a lot of information about the World Jamboree, I dived into the operators of PHP. Some of them could already be evaluated, but I wanted to add support for other operators as well. I have made a tool to check the evaluation and named it 'eval-php'. This tool is capable of evaluation a subset of PHP. It cannot handle control-flow (yet?), but it can support several operators on the integer, string, null, Boolean and float type. The following list of operators are supported:

Small note about the arithmetic operator 'mod'. It has no support for floats until issue STR-630 is fixed.

This also means that the bitwise operators, error control operators and execution operators are not supported at this moment. But as said before, it supports a subset. Supporting the Error Control operator when there are no errors generated is a bit strange anyway.

It is fun to be able to evaluate thing, but it is even more fun to see things that are evaluated. So the tool produces evaluated output. This means that everything that is not within PHP-tags is outputted. There is also support for the 'echo' statement, so this produces output too. I have put a small test-file online here. The output of eval-php is available here, it can be compared with the real output of PHP itself here, take a look at the source of the page.
I think it is all pretty fancy :)

Working on evaluation

I have been working on evaluation the past few days. Not the evaluation of the SoC, this is something I am going to do that next Monday, but that of PHP. The reason for this evaluation can be found in the first blog of Monday.

There is currently support for arithmetic operators on the following types: integer, float, Boolean, null and string. Types can also be juggled to an other type. It took some time to get this juggling right. Especially the juggling from String to numbers. The string is now sort of parsed within Stratego to get the part that can be recognized as a integer / float. But it works :)

Speeding things up

There has been some major performance boost implemented today. The build and tests of the package took about 7 hours in the build farm, which is a lot. Even for Stratego standards and even if it is build on 3 different platforms. The problem was the processing of the test-input. A lot of small inputs are processed by writing it to a file, parsing the file and reading the result. This involves a lot of IO actions. So the inputs are not written to files anymore. This helps a little, but the major boost came with caching the parse tables. The parse-table is now opened once for every process instead of every test / input.

Splitting up

The project holds two libraries in it's source. The first library is php-front and holds a parser, pretty-printer, a reflection part and a set of common strategies. The other library holds the strategies that are used within PSAT. The PSAT-part depends on php-front. The base library can be useful for more libraries and will be branched from psat soon. It will be a separate package called, how surprising, php-front. It will be available here. You can also find a link there to the page that will be the home for php-sat. They will both be filled soon.
Note that the name has slightly changed. This is to prevent confusion with possible tools that will statically analyze Perl, Phyton, Pascal, PL or Prolog. The other reason is that psat.org was already taken. All the names will hopefully be updated as soon as the packages are separated from each other.

Project deadline

The project deadline is today at 17.00 hours, at least here in Holland. This means that the project is due in a few minutes. My mentor needs to fill in the evaluation based on the code that I have now.

This deadline does not mean that the project will stop. I am satisfied with my results up until now, but I will continue to improve the code and the tool. A more elaborate report about the SoC and the progress of the project will probably follow in a few days.

Including files and hidden tasks

File inclusion within PHP can really work on your nerves. I can remember an occasion when it took me about an hour to find out why a file was not included. This was because PHP first looks in the working-directory and then in the current directory. If there is a file with the same name in both directories the first one is included. If I had taken a look at the documentation then this would not have come as a surprise to me.

Way this paragraph about including files? Because this feature is added to php-front. There is now a strategy available that uses the 'require' and 'include' statements within a PHP-file to find and parse other PHP-files. These PHP-files are added to the environment and can be accessed anywhere in the traversal.

The implementation of this inclusion strategy included the modeling of the different paths that PHP used. As mentioned above, PHP uses two different directories to search for files. The working directory, where the process was initiated, and the current directory, where the current file resides. These directories are the same if there is only one file to be processed.

I did not want to mess with the current-directory of the tool when including files, so the paths are kept internally. This requires some string manipulation to make up the right include-path, but it works. By keeping track of the include-path and working-directory within the application there is even support for the functions ini-set and chdir!

All this could not have been done without moving things around and adding some small but useful strategies. This included: a separate section is added to the library that holds common strategies, the parsing is moved to the library and the environment can now store complete AST's. All of these things are small and easily done. But unfortunately, many small things take up a lot of time.

The mechanism of including files can be improved by adding partial evaluation of PHP-code to the library. This will add annotations with known values to terms. By using this, and some constant-propagation, the following will succeed:

<?php
    $path = './some/path/';

    require_once $path.'filename.php';
?>

This is commonly used within projects so this should be supported. So on to the simple evaluation and constant-propagation!

One week later

I flew back from Spain yesterday evening, no problems with my luggage, and spend this day getting everything organized again. I had planned on going to Lowlands this weekend, but I have to skip it this year. But I will join the party again next year I hope.

But what have we done the past week. I have added a few patterns to the code:

The last one needs some work though, the rest of the function-names should be added to the list.

I have also spend some time on improving the pretty-printer. I have added unit-tests for the different parts (expressions, statements, documents) and made sure that things are consistent. An example of this is that every construct that has a list of expressions is printed in the same way. The expressions in the list are separated with a comma and a space. A small example:

 //used to be:
  array($foo,$bar,$fred);

  //is now
  array($foo, $bar, $fred);

I have also made the strategy that extracts the safety-type build a default value of 'Safe()'. This makes the application more conservative because if it has no knowledge about a variable it will consider it to be safe. This should prevent a lot of false positives.

The last item of interest is a small document with some considerations about constant propagation and including files. The response to the user should show the code that the user has entered. So constant propagation should not be done on the current ATerm, but the information should be added to it. The only thing I could come up with is to add the information in an annotation. But if you have a better idea don't hesitate to comment.

Relaxed working

It is not easy to get some work done while there is a private pool nearby, but I managed to get some things done. I have added the syntactic rules that make sure that an end-tag ends a line-comment. This involved another rule that recognizes nothing as a constructor. The follow-restrictions are really important in this case, but I think I managed to get them right.

I have also added the right syntax for backticks. This syntax is quit similar to the syntax for double-quoted strings. So the productions that make up the double-quoted strings could be reused to make the syntax for backticks.

A small note

I have added two more patterns to the correctness category. They have the codes C002 and C003. The last one is also mentioned in the PHP manual, but it can't hurt to point this out to the programmer. The other thing that I have added is the test-suite-file for the tests of phase 4. This suite contains only two tests for now, but more will be added during my vacation.
"Until next time, take care of yourself and each other."

working on other things

I have done interesting things today, but they did not all involve psat. What I did for psat was the adding of a strategy "analyse-safetylevel" to replace the "annotate-sources" strategy. This strategy is set up in a different way to make it easier to add support for other constructs.

What I have also done for psat is changing the way safety types are combined. My first thought was to combine safety types if they have the same level. Variables can then have the safety types "escaped-html" and "escaped-slashes" at the same time. This is something one would expect.
However, combining safety types of other levels would results in variables that can have both the safety types "formatted-string" and "encoded-string", or even "object-type" and "integer-type"! This is definitely not desirable. So the only types that can be combined from now on are the types that are on the "escaped-something"-level. Writing this makes me realize that I will have to add some more terminology.

You might want to know what other activities I did today. If you are not interested you might want to skip this paragraph. I have had my first experience with compiling and installing PHP on my linux machine, a nice little laptop. Quit nice, but the real cool activity was a pair-programming session with my mentor. The commit-message of the result can be found here. And the module with my name in it is located over here. Always nice to have your name in a software package that you are using a lot.

I have also finished setting up my laptop as a development environment. This had to be done because I will be flying to Spain on Saturday. One and a half week of nice weather, a private pool and a lot of relaxing. I will be working on some things during my vacation, so don't panic :)

straight to phase 4

The tool has upgraded from phase 2 to phase 4. This is the current phase that needs to be implemented. I will give a short sketch of what the tool supports at this moment, besides a hand-full of common patterns.

The tool is configured with a text-file of which some sections where explained yesterday. A default configuration file can be found here. This is definitely not complete, but it already produces useful results.
Two features are added to the configuration file. The first feature makes it possible to define a precondition for the language-constructs 'echo', 'die', 'exit' and 'print'. The syntax for this is:

 construct: construct-name (  precondition  )

The second feature is the possibility to define functions with a default level. This can be done in the '[function result]' section. The functions that are specified there are assumed to always return values with the specified safety-type. This is not limited to build-in functions, user-defined functions can also be assigned a default safety-type result.

But which results are produced? The following example shows two things that are supported right now:

< ?php

      echo "hello ", $_GET['name'];   //is flagged

      print $_POST['param'];  //is flagged
?>

Keep in mind that the results depend on the configuration. These results will appear when the default configuration is used. A more precise configuration will give more precise results.

From phase one into phase two

Major leap today. The implementation of phase one is ready. This phase consisted of parsing a configure-file and using this configuration to identify tainted sources. A typical configuration file for tainted sources will look like this:

 [tainted sources]
     array: _GET
     array: _POST
     function: file_get_contents

But then there will be a lot more entries off course.
The configuration file is used because people should be able to easily tweak the tool. This is why I have also included the option to specify a certain level for each source. A little example:

 [tainted sources]
     array: _GET   level: escaped-slashes,escaped-shell

This means that the variables coming from the $_GET-array return values in which both slashes and shell-characters are escaped. These safety-levels are used in the static analysis.

That was phase one, on to phase two. This is also a configuration file-issue, namely the identification of sensitive sinks. They can be specified in almost the same way as tainted-sources.

[sensitive sinks]
     function: foo ( safe )
     function: bar ( int-type, esc-h )

This configuration file means that the function "foo" expects one argument and this argument should have the safety-type "safe". Function "bar" has two parameters which should be of "integer-type" and "escaped-html" level. For each parameter the safety-type should be specified. Parameters in which the safety-type does not matter can be assigned the type 'unsafe'.
One can also specify that a certain parameter needs to have one of more types, or it must have multiple types.

[sensitive sinks]
     function: foo ( esc-h || esc-sh , esc-h && esc-sh)

This means that the first parameter needs to have either the type escaped-html, or the type escaped-shell. The second parameter needs to have both types, and thus levels, of safety.
There are two things to consider when writing such a configuration file. The first one is that it does not allow functions without parameters. This is not that strange because what is the use of a sink in which nothing can be thrown? The other thing is that the and- and or-operator should be used with care. Since the level that are specified represents the minimal level of safety, the safety-types that are used in operators should be of the same level. The following represents something that must be both safe and unsafe. Which can only be true when the variable given has type 'safe'.

[sensitive sinks]
     function: foo ( safe && unsafe )

Tomorrow I will be adding some more configuration details for the sensitive sinks. Some language constructs must be configurable and it does not take very long to add this. I will also be working on checking the safety-types of parameters given to sensitive sinks. So the first useful results should appear soon.