Reflection support

Sometimes you have to reflect on something, and it is now even possible in php-front. The inspection of functions during a traversal was already possible, but the first implementation also retrieved all the functions within classes. Since these functions are not in the global scope this could result in some serious errors. So this is fixed and support is added for classes and interfaces.

To be able to do reflection in a analysis you have to create an environment and initialize it so that it contains all the functions. This is off course captured in some strategies so it is very easy. The only thing that you have to do is make a choice between an environment for PHP4 or PHP5. The grammar is split up in this way and I found it more cleaner to also separate the
environment. This is also use full when the internal functions are added to the environments. PHP5 has a few more of those.

The reflection library is constructed in the same way as the dryad-library. It is actually set-up in a OO-way. So there is a notion of an PHPEnvironment, which is abstract, and a PHPFunction, which is also abstract. Each version has his own kind of environment and function which makes it easier to reason about.

The only problem is the implementation of all this. It requires very careful thinking and precise notation. It is very easy to confuse some strategies and this can result in some long debug sessions. But luckily each part of the reflection is carefully tested, but that is normal right ;)

Results from the meeting

So this afternoon I talked to my mentor. Several things where discussed and all of them where positive. The structure of the code for the project was good and organized. Only some miner points about the formatting of the code. Really not a bad result. He also has a cool idea about the organization of the patterns. We might be able to organize it in such a way that patterns can be plugged in at will, which will really boost the extensibility of the tool. But we will have a look at that later.

Then there was an issue of the type-states. After the response of Christian yesterday I have come up with a set of safety levels. Each variable in the program will be assigned such a safety level. When a computation is made the result will get a level of safety which is normally the minimum of the levels of the variables involved. The sensitive sinks will get a precondition for each parameter. This precondition describes the minimal level of safety needed for that parameter. When a variable does not meet the precondition the function will be flagged by the tool. Simple isn't it :)
Since the preconditions will not be equal for all applications, and some functions can be trusted sometimes, the configuration regarding sensitive sinks and tainted-data sources needs to be configurable. I will be writing a small syntax for three ini-files that allow everyone to tweak the application.

The last thing that we did was making a start with the reflection-library. When traversing the tree of a program it is use full to have access to the functions and classes that are defined. Since some of the functions are defined by the user there should be a way to access the implementation of the functions. This can be done by traversing the tree every time a function is needed. You can imagine that this is expensive. So the trick is to traverse the tree once and build up hash-tables for things that one wants to access frequently.
So it is now possible to get the AST of a function by it's name. So when a function call is made the implementation of the called function can be retrieved on the spot. I will have to add support for classes and some more strategies to get interesting properties of the AST's, but I now know how to do it. A lot of new and fun stuff to do!

Adding patterns

Today I have added the documentation of eleven bug patterns: C001, O000, O001, O002, O003, S001, S002, S003, S004, S005 and S006. You can find the description of each of the patterns here. Remember that each category has it's own prefix, so you will have to look within the directories.

I have also added the tests and implementation for patterns C001, O000 and O001. The patterns that are not implemented yet have [Not implemented] in their description. They will be added (hopefully soon).

Their was also some disappointment today. I received a reply from Christian who I had mailed yesterday. They can not share the research that they have done regarding the list with preconditions and type states. The algorithm has been put in the products of Armorize Technologies. Sharing the crucial part of your products is not a good idea for a commercial company it seems :)

But he gave me some pointers on how to start, so that is what I will be doing tomorrow morning. I will be having a meeting with my mentor about the set-up off the tool, but I think it is okay.

Testing 3 4

I have added support for SUnit tests to the project today. This also includes some tests for the two patterns that where already in the tool. I have also added tests and the implementation for a style-pattern. This one has the code, you can probably guess it, S000.

It took me some time to get all the tests working and the compilation of the tools is not that fast. So I have the feeling that I am not making that much progress the last few days. But the adding of patterns an such should go a lot faster now because there is a method to do it. So I can focus on the implementation of the patterns instead of how to add them.

I have also send an email to the authors of the paper I mentioned yesterday. I hope to hear from them soon because I am reluctant to start the implementation of the phases. They should have a lot of use full information which I can use. I will wait until Friday and otherwise I will just have to start with the phases. So I will be adding patterns to the tool as fast as possible until Friday :)

A long one

After a whole day of doing research I just had to do some coding. So that is what I did today. I have added the basic stuff to add 'bug patterns' to psat. This should not have taken long because the detection of these patterns can be very simple. So I could have done this by simply dumping the rule in a file be done with it. The rule is certainly put into a file, but I have also added a directory structure within the library to make sure that the patterns stay organised.

As my mentor mentioned, the patterns should be organised. It is natural to put each pattern into a certain category. The following categories seem to be use full to start with:
  1. Style
  2. Correctness
  3. Performance
  4. Malicious code vulnerability
  5. Information leak
The first three are straightforward. The fourth category is mostly dedicated to the initial goal of PSAT. The last category will hold bug patterns that can expose data which is not intended for normal users and should not be shown in a production environment. Each category will get his own directory and each pattern will get his own file. Each pattern will also get an unique code which holds the category and the follow number.

The first bug pattern to be in PSAT is using the If-construct to check the existence of a variable. This pattern falls into the category of Correctness and therefore has the code C000. I know it is a bit optimistic, reserving space for 999 bug patterns in one category, but you can never be sure :)

I also wanted to write something about the categories of the functions. While I was reviewing them I thought about how different functions need different escapes. A query can hold HMTL-characters but needs to have escapes for all the quotes to prevent SQL-injection. Data that is send to the user should escape slashes, but also HTML-tags. So I will have to go over the collected functions and give each sensitive sink a certain level of safety needed. This is the same approach as the one taken in [1] and it seems to work there. They have implemented this in a tool called WebSSARI, but the site of this tool seems to be deleted. So I will try to get in contact with the authors of the tool, but it probably means a few hours of research.

1: Yao-Wen Huang, Fang Yu, Christian Hang, Chung-Hung Tsai, D. T. Lee, and Sy-Yen Kuo. Verifying web applications using bounded model checking. In DSN, 2004.

Doing research

I have to do some research before I can start with the different phases of PSAT. Some of that research is relaxing and fun to do, an other part is quit annoying. Let's start with the first part. Since I want to add the possibility of providing feedback about code smells I have to find code smells to give feedback about. Besides the topic on GoT I start reading up on my subscription to PHPArchitect. I still had to read four issues, now two. You might wonder why this is fun to do. Well, I have a balcony with just enough space for a hammock and I have printed the issues. You can figure out the rest yourself.

But there was also a more annoying part. I have to divide the internal functions of PHP into three major categories:
  1. Functions that can return tainted data
  2. Functions that can untaint data
  3. Functions that are sensitive sinks
See the Terminology section of the Phase description for some explanation about the terms.
When I was going over the list I also made a list of functions of which the information should not go to the user. Such as functions that retrieve all kind of information about the system.

It is interesting to check out which functions PHP has, but it becomes less interesting when there are over 3500(!) internal functions. So I spend my day was with reading function-descriptions, but I have at least seen all functions. I will explain some more things about the categories tomorrow. I will also add some stuff to the tool that will make it actually use full :)

And what have we done today?

This morning I worked on the actual psat-tool. It parses a file (using parse-php), applies a strategy to the ATerm and pretty-prints it (using pp-php). The strategy it now applies is:
strategies
dangerous-variables = topdown(try(matchthing))

rules
matchthing: Simple(_) -> Simple("foo")

So what does this do? It matches on a Simple-node and gives it the name 'foo'. This means that every variable ('$name') is renamed to '$foo'. Not really use full, actually quit bad, but it shows the power of what can be done. Two actual lines of code and all variables are renamed.

This rule will be deleted in the next commit because it is not use full. But what is use full then? I have asked the people at GoT and gotten some reactions. Still hoping for more feedback off course.

The other thing I worked on today was the specification of the phases. I had a plan two posts back but I rewrote it a bit after some feedback of my mentor. The result can be read here.

The last thing I did today, besides writing this entry, was making the pretty-printer simpler. I had a strategy that rewrote a list of statements, but the framework could do this for me. The cleaned up about 30 lines of code, which is always nice :)

Adjusting the plan

Today was filled with some bug fixing and new ideas. I have fixed a few small things in the pretty-printer that caused the failure of test-cases test-cases. I admit, I was a bit optimistic yesterday. The results of the round-trip tests is looking very cool now, if you want to see a lot of sunglasses I suggest you take a look.

The new idea came from martin in his reply to the phases-description (which I will rewrite/improve tomorrow). The idea was to add other indicators for bugs to psat. Just simple things that can help developers to improve their code. One example of this is the usage of a variable that is not declared. This can great problems if register globals is on. Another example comes from the PEAR-coding standard for including files. Adding these use full remarks would turn PSAT into a PHP-equivalent of FindBugs, which has a long list with descriptions. Some of them also use full for PHP.

But I could not find lists of specific code smells for PHP. There are some general code smells that can be used, but I think there are more things specifically for PHP. So if you have any ideas about what a PHP-programmer should not do, or tricks that can improve the performance of PHP-scripts, please let me know.

Pretty problems

Today was a wonderful and, for Dutch standards very, hot day. The fan in my room was pointing at my computer instead of me to keep the temperature of the processor at a decent level. So what has been done in this heat for psat? The pretty printer is somewhat finished! If you take a look at the parse-results you will see a lot of cool smilies. There are just some minor problems to be solved. A trailing slash at the end of inlineHtml and some diff that do not go as planned. Some of the constructors could also be printed prettier, but this is something that can always be done.

I decided to split the pretty-printer into three different tools. Just as the parser is, so one where you can specify the release and two tools for the specific releases. Consistency is always nice. I have also added constructors and pretty-print rules for the alternative syntax. The reason for this is that the same constructor could have children of different formats. This introduces if-then-clauses in the rewrite rules which are not really intuitive. With specific constructors I could also print the alternative syntax in the same way.

The next phase(s)

The round trip has been added to the test-script. The new smileys should be popping up soon in the parse-results page.

Since the pretty printing is coming along fine, I considered the phases for PSAT itself. I have also posted this to the psat-dev list. But this could also be discussed here off course. Any comments or suggestions are most welcome.

Phase 1:
Identify and annotate the variables that come from the user with 'un-safe'
This includes the research of which variables can be altered by the
user. Some info

Phase 2:
Identify and annotate the functions that can cause vulnerabilities
to occur if a 'un-safe' variable is used.

Phase 3:
Generate a rapport in some form to show the user the possible vulnerability
At the end of this phase the following should give use full info:
 echo $_GET['name'];    
The design of the rapport is also considered in this phase.

Phase 4:
Identify and annotate the functions that make variables safe for use
within the 'un-safe' functions. The annotation should also be added to the variables used and propagated.

Phase 5:
Add support for assignment variables and the (simple) propagation thereof

Phase 6:
Add support for reference variables and the (simple) propagation thereof

Phase 7:
Add support for the propagation within and coming from(user-defined) functions

Phase 8:
Add support for the propagation within and coming from objects

Notice that the phases are build upon each other. The base-foundation
is made in the first four phases, the simplest case. The other phases
each add a layer of complexity. Each phase should result in a working
tool that supports the mentioned constructions.

Back from the woods

Great wheat er, great terrain and nobody has become ill from my cooking, a successful trip. Now that I am back the fun is starting again. The pretty-printer can parse all literals such as variables, strings and operators. There is some support for expressions, but more will be added tomorrow.

The test-script will also be expanded with a pretty-print,parse,pretty-print and diff round trip. The difference between the original pretty-printed and the parsed-pretty-printed version should be totally empty off course. This will show the 'correctness' of the pretty-printer. If everything goes according to plan then this will be also be finished tomorrow. The script that is...

Tools!

What is the thing that every programmer wants? Tools off course! The current revision 74 has a few tools that let you parse php-files to the ATerm-format. 'parse-php' is a common tool with a option to choose the release you want. For everyone who wants to have a separate tool for each version we have 'parse-php4' and 'parse-php5'.

The other tool that is available is 'pp-php'. This tool will pretty-print a php-file in ATerm format to an actual PHP-file. This will also be tested with the distribution files, but it is not yet finished. It supports variables and operators. The rest will be supported as soon as I come back. Yes I'm going away for a couple of days to cook for a group of girl-scouts. So no ATerms, SDF or command-line tools for a few days.

Version 5 support

Well, that did not seem very hard. There where some minor problems with the new features in version 5, but as of revision 63 the files from the distribution of PHP 5 are also parsed correctly. There are no ambiguities and the SDF passes almost 500 unit-tests. If you find anything that should be parsed and fails with the SDF, please let me know.

For the people that are interested, a (short?) list of differences between version 4 and 5, it lists what the version has in contrary to the other version. This only includes the syntactic differences, not the semantic differences.
  • Version 4 has old_function support
  • Version 5 can use results from function as if it was an object
  • Version 5 has __METHOD__ as magic constant
  • Version 5 can access class constants with '::'
  • Version 5 has 'clone' and 'instanceof'
  • Version 5 support references within for each-variables
  • Version 5 has support for final and abstract classes
  • Version 5 has interface support
  • Version 5 has support different types of functions within classes (e.g. private,public,etc)
  • Version 5 has constant class variables
  • Version 5 has try-catch and throw supports (no 'throws' for functions though)
  • Version 5 has type hinting for functions

Version 4 support

The last few days were filled with 'last-thing'-activities. The things that should be done before the big vacation starts. I have made the last exams of this year on Tuesday and Wednesday, they went reasonably well. Thursday was the last day at my work as an assistant-teacher at the Sint Aloysius, a Dutch elementary school.

But these are not the only 'last-thing'-activities. I have also fixed the lasts things for the SDF of version 4. All the files in the distribution of PHP4 are parsable by the SDF as of revision 58. A little note about the test-files from the distribution. I removed 1 file and others were sligthly adjusted. The removed file was a test for a parse-error. The adjustments I made were commenting out two small pieces of code. There pieces of code where not parsable by the distribution itself. For example, there is a file with an old_function declaration in the files for version 5. This seems to be incorrect because the parser-definition of version 5 does not have a token for old_functon.

Another thing that can be interesting is that I encountered some strange behaviour with the syntax for if-elseif-else-constructions. This involves the dangling else problem. But this problem is easily solved. The problem seemed to be solved by the standard way, but it reappeared when there was an extra line break at the end of the if-if-else-construction. This is eventually solved by adding an extra rule for empty else-if blocks, but this could be a bug in the SDF-parser.

So a lot of things are finished, but the real work is about the begin. Next stop: full PHP5 support!

Highlighting the Context

I am one of the people that use Context for working with
files that are purely text. This editor has support for finding matching brackets, highlighting a lot of source code and I am used to it. But it did not have support for highlighting SDF files. I say had because I have put one together. This is not a real big deal, if you look at the contents of the file you will see this, but it might save someone some time. So a highlighter for SDF in Context can be found here, and for the testsuite-file here.

And off course a little status update. The SDF can parse and is tested against expressions and a few statements. It also does a pretty good job on the files from the distribution of version 4. So we are definitely going in the right direction.