A Doh-situation

I had a real 'Doh'-situation yesterday, it happened when I was trying out php-sat on several PHP-projects. I began with trying php-sat on the sources of Phorum, phpBB, phpMyAdmin and Joomla.

The results for these projects are very nice, considering the current status of the tool. I was surprised to see that the pattern C003, which is very simple and even documented in the PHP-documentation, was actually flagged in both phpMyAdmin and Joomla. The functions to blame can be found in Table.class.php (generateAlter and generateFieldSpec) from the phpMyAdmin-source, and database.php (database) from the Joomla-source.

A pattern that was found quiet often in phpMyAdmin is pattern O001.
I have done some simple tests and it turns out that passing multiple parameters is roughly twice as fast as using concatenation. This blog-entry confirms this by explaining what happens behind the scenes in both cases.
So the pattern describes a possible optimization that is not very useful if this patterns occurs only once. But if it happens in 32 cases in a single source file that is called common.lib.php, which is probably used commonly in the project, then it might be useful to take a look at it.

But back to the 'Doh'-situation. After trying php-sat in the projects I wanted to run it on a large php-codebase. A great resource of PHP-code can be found in the PEAR repository. The PHP-code for PEAR itself is about 30000 LOC so I gave this to php-sat first. I was not surprised that it took a while because the code is pretty large. But I got a little suspicious after about 10 minutes of just seeing files being parsed. So I put some more log-messages in the php-front-library and tried again. It turned out that php-sat was including 3 files in a cycle, but the includes happened because of a 'require_once'. I was pretty sure that I had fixed this problem by checking whether a file was previously included, and this was indeed the case. The only problem was that I put the files that where included in the environment after going into a file. The solution is pretty obvious, but such situations just call for a slap on your own forehead.

Propagation

So this week was still filled with constant propagation techniques, but also with software metrics. I have given a presentation about functional software metrics which was based on this PhD thesis. It is very hard to find papers about software metrics for functional languages, but it is even harder to find them for scripting languages! Please let me know if you know a website that covers metrics specifically for scripting languages, I don't expect there to be any papers about this subject.

I am looking for software metrics for scripting languages because I think that metrics can help in finding the weak spots in a program, but there has to be evidence to support this. Validation of software metrics is rather time-consuming, so hopefully done by somebody else :) It shouldn't be hard to make a tool based on php-front to generate a rapport with software metrics about your program, any volunteers?

But back to the constant propagation. I have added support for the propagation within loop-constructs, if-statements and switch-statements. This was not very hard because I could use the techniques described in this paper. The construct that was not discusses in this paper, or on any slide of the pt course (which is not given anymore, unfortunately), is the if-elseif-else construct. It is not very hard to come up with the (hopefully correct) semantics for this, but I could be wrong of course. I will explain my interpretation in the next section, but it might be hard to follow. Feel free to ask for a better explanation!

The if-elseif-else construct is evaluated as a list of list of statements. The first statement is used as an unit-element in a foldr over this list. This foldr applies a given strategy to every element, which is to be expected. But it also performs an split-and-intersect of the dynamic-rule-set between every two elements. The result of this is that the rule-set will only contain the dynamic rules that are stable in all branches of the if-elseif-else statement. The implementation of the switch-construct was pretty easy after I made the infra-structure for this "split-and-intersect"-foldr.

So the constant propagation is coming along very fine and is also added to the php-sat-tool. The new flag --complex-inclusion triggers the tool to perform constant-propagation and include the files that can be included in that way. The constant propagation is not completely finished, to consider this an 'alpha-feature in this 'beta'-tool.

Not the only one

It is always nice to know that you belong to a group. I discovered this week that I am part of a large group of people that still works on their SoC project. It turns out that many people still try to work on their project, but at a slower pace. I am one of those people :) I am also one of the people that can walk around in a Summer of Code 2006 t-shirt (front,back), it arrived this week. But enough about bragging, let's talk about PHP-Sat again..

You will probably know already that Google has released Code Search, a search engine specifically for code. It turns out that it can be used to find vulnerabilities in PHP sources. The regexp for XSS due to echoing raw input, which is heavenly quoted on the Internet, gives around 7000 results. There are off course false positives in these results, but many of the results would be 'useful'. Could there be a better way to show that PHP-Sat is needed?

So I have been working on PHP-Sat and, more specifically, on the completion of the constant-propagation. I have rearranged the tests for the simple-evaluation. There used to be one big testsuite with over 450 tests on almost 3000 LOC. There are now separate testsuites for expressions, operators, primitives and so on.

Functionality is added in the sense that constants can now be defined and actually used. I have also begun with the implementation of references. The manual says that references are just pointers to the same memory location. I have mimicked this behavior by adding an extra step between a variable and its value. Every variable is rewritten to a 'variable-identifier', this 'variable-identifier' is rewritten to the actual value. Introducing a reference is now a simply introducing a rewrite from a variable to a 'variable-identifier'.

Oh, and I have also updated the implementation of pattern C002 to handle static function calls. Mmmm ... I should really put more time into those patterns.

New and Improved inclusion

This week was not only filled with going to the university and playing around with PHP-SAT, it was also filled with (a lot of) preparation for the Flitsjacht, a game for scouts in our region. Organizing a game that is played by over 150 people in 11 teams in an area of about 10 square kilometers takes a lot of time, but the photo's are priceless.

Organizing this game does not mean that there is no progress on PHP-SAT, revision 256 holds a few new things. It includes some new tests, a few extra lines of source code and a lot of new functionality! The basis for the '--complex-inclusion'-flag is finished with the evaluation of the internal functions 'include','include_once', 'require' and 'require_once'.

The implementation of this evaluation was not that easy because of a small detail in the strategy 'find-in-path'. This strategy takes a list of paths and tries to find a file in those paths with a given file name. It does this by checking the existence of each 'path + file name'. This would always return the full 'path + file name', the next paragraph explains why this is useful, and worked like a charm. Until I tried it with a file name that was available at the current-directory! I did not get back the absolute path, but the original given file name. It took me a while to figure out that the strategy first checks for a file with only the given file name without taking the paths into account. I solved this by just looping over the list with directories and now everything is fine.

But why do we need the absolute path to include files? This makes it a lot easier to implement the logic of the 'include_once' and 'require_once' functions. Before we include a file we check if the file was already included by using the absolute path. The absolute path is necessary to prevent false positives in this case. False negatives are prevented by normalizing every path after a file is found. This normalization removes all '.' and '..' steps in a path and returns the direct and absolute path.

Next step: evaluate some more internal functions, especially the 'define' function that defines constants.

I love my test suite

When I started to implement the evaluation of control-statements for the constant propagation I realized that I made a wrong decision. This occurred to me when I implemented the control-flow with a single if statement. One of the side-effects of such a statement is that it should undefine variables that are changed within the body when the condition can not be evaluated.
For example:
<?php
$foo = 'bar';
if($bar == 1){
$foo = 'foo';
}
//Value of $foo is unknown
?>
Because we do not know the value of $bar we do not know the value of $foo after the condition. So the dynamic rule that maps $foo to the string-value "bar" should be undefined. Stratego has special syntax for these kind of situations and this makes the implementation very easy.

When I wanted to do this for arrays I realized that arrays where implemented as a key-value mapping in a hashtable. This is correct, useful and it works, but undefining mappings takes a lot of work because the special notation would not work with this implementation. So I had to change this implementation and use dynamic rules.

My test suite really helped me in the refactoring of this implementation. It showed which part where updated correctly and I could use the tests to analyze why the rest still failed. It also made me aware of the fact the when a dynamic rule is undefined the key still exists. The application to this undefined key will always fail, as expected, but the key still remains in the set of all keys of a dynamic rule. The test suite also identified some hidden assumptions in the old implementation, which where immediately removed.

All this information would not have been available to me without the test suite. To quote a large company: "I'm loving it!"