Bouwers > Eric > Blog: PHP-SAT

Showing posts with label PHP-SAT. Show all posts

Migration to the nine-headed monster

After some recent activity on the psat-dev mailinglist I became aware of the (lack of) available builds for php-sat. Even though the development speed is not what I would want it to be (so much fun things to do, so little hours in a day!) I still believe it is important to release early and often.

Fortunately, Eelco Dolstra had some time to migrate php-front, php-sat and php-tools to Hydra, the new Nix-based continuous build system. After some tweaking we now again have access to unstable build for all PSAT-projects. Go Hydra!

So much interesting to read ...

... and less time to do it!

Since I started my full-time job I noticed that it is a lot harder to stay up to date with all the latest (online) developments. During the writing of my thesis I had a considerable amount of time which I could spend on reading blogs and news-items, and following mailing-lists. When you spend most of your day actually working you learn to prioritize which things you want to read :)

One of the things I did managed to read, albeit a bit late, is this thread on the PHP.internals list written by Wietse Venema. It is a follow-up on this thread in which a first proposal was made to integrate a perl-style taint-mode into the core of PHP. The results posted in the follow-up thread look promising. It is also interesting to see that he has gone from a black-and-white taint-mode, to a more leveled approach. Currently the proposal only contains a subset of the levels available in PHP-Sat, but I think that the most fundamental ones are definitely there.

Even though Wietse is developing a prototype I think he will have a hard time getting this taint-mode into the actual core of PHP. Within both threads the general opinion seems to be that the idea is nice, but the developers of PHP seem to think of many situations in which it could fail. I hope to see more results of this idea soon!

Going over some other interesting threads in the internals-list I found a reference to a tool called PHPLint. (Isn't is funny to see that there are all sorts of initiatives popping up that that try to make PHP more secure/stricter). I haven't have time to take a better look at this tool, but a first glance definitely showed potential. I'll try to examine this tool more thoroughly at the end of this week.

While there is less time to read on-line material, there is more time to read off-line stuff. Since I am using public transportation to get to work I have an extra hour a day to read actual books and publications. One of the books I have read in the past few weeks is a printed version of "Producing Open Source Software" (ProducingOss) written by Karl Fogel. My conclusion: absolutely worth reading!

ProducingOss contains all sorts of tips, hints and best practices. Even if you are not involved in an open source project it is still useful to read. Almost everything in the book can also be applied to closed-source projects. Furthermore, it contains many pointers to other interesting literature. One of these pointers lead me to The Cathedral and the Bazaar, my current read-while-traveling-to-and-from-work book.

I intend to use several things from ProducingOSS within PHP-Sat. I will just have to think about how I can fit the project in my current schedule, but it will definitely be fitted in.

Where you at?

Now that the propagation of safety-types seems to go smoothly it was time to dive into another subject: accessing the location of terms. In this case, the location of a term is defined as the location of the text in the original file that was parsed to that term. Since the strategy to annotate the AST with position-info is available in the standard libraries nowadays, it should be easy to access these locations and finally solve PSAT-91 right? Lets find out!

The first thing I did was to add a separate module to handle the low-level stuff of getting the location annotations. This module contains several getter-strategies that can retrieve, for example, the number of the start-line. The location info is captured in six different numbers: start-line, end-line, start-column, end-column, offset and length. A getter-strategy is available for all of them. Furthermore, the name of the file in which the term is defined can be retrieved.

Although these getter-strategies are useful, they are not meant to be called directly. I figured that the most common use of these functions would be reporting the values in some kind of (formatted) message. In order to capture this kind of behavior the strategy format-location-string(|message) is defined. This strategy takes a message with holes in the form of [STARTLINE] as parameter and fills these holes with values from the current term. A rather useful strategy if I say so myself.

To practice with this new piece of functionality I have added an extra option to the tool input-vector of the php-tools-project. This option allows the user to choose between the normal list, or the same list with line-numbers printed for each access. More information about this option and how to add an option yourself can be found here.

After this was done I moved to php-sat to make the output more concise. It was actually pretty easy to implement. The algorithm is nothing more then get-terms-with-annotations, make-nice-output. I actually spend more time on creating a test-setup for calling php-sat through a shell-script then on generating the more concise format. The only problem was that the adding of position-info everywhere interfered with the dynamic-rules. A few well-places rm-annotations where needed to fix this. Please let me know if you like the new output, or whether something should be added.

The next applications of the location info is the tracking of where untainted data enters an application. When a function is called with a parameter $foo which is tainted, it would be nice to show when it was tainted. I think this is not too difficult to add, but bugs always seem to lurk in 'I-think-it-is-easy-to-add'-features.

A last remark about locations is a small problem without an actual solution. Eventually php-sat must support function-calls. The algorithm to analyze function-calls is not complicated, but how can bug-patterns within a function be reported? A message before each call to this function? Within the file in which the function is defined? And what about cases in which one call is flagged and the other one isn't? And can we also handle object-creation in the same way? I haven't figured out how to handle this, so if you have any ideas please let me know.

Some new features

Last Friday, I had a conversation with nEUrOO in #stratego about the results he was having with php-sat (see his blog for more details). After the conversation, and after looking at the test-file he mentioned in his blog, I got started and added 2 new features to php-sat.

The first new feature is aimed at the usability of the tool itself, and is visible by a new command-line option: -ra CODE. This option accepts one of MCV, COR, EI, STY or OPT as input, and makes sure that the only analysis that is run is the one that belongs to the given code. In other words, you can now run, for example, only the correctness checks by calling php-sat with: --ra COR. This gives you a somewhat coarse-grained control over the behavior of the tool. A plan for more fine-grained control (on the level of patterns) is also mentioned some time ago, but the implementation of this level of control requires some more thoughts. In the meantime, please enjoy running a single kind of bug-patterns :).

A second new feature is added to the analysis of safety-levels. Consider the following example:

<?php
  echo addslashes(htmlentities($_GET['name']));
?>

The default configuration for echo requires a parameter to have both the level EscapedHTML as well as EscapedSlashes. Furthermore, the default configuration defines the return-type of the functions as:

function: addslashes       level: escaped-slashes
function: htmlentities     level: escaped-html

So this piece of code should not be flagged by php-sat. Unfortunately, previous revisions did flag this piece of code!

The problem here is that the analysis uses the safety-level of a function that is mentioned in the configuration file without considering the parameter of the function. This behavior works well for most functions, but when functions only add a property to their parameter it becomes incorrect. Because of this behavior, the echo-statement is flagged because the call to addslashes is only annotated with EscapedSlashes.

Fortunately, the solution is not that complicated. Since we know that there are several functions that add a certain property to their parameter, we add the possibility to specify this in the configuration file. The new syntax for functions that add a safety-level is a '+' after the level of the function. This '+' forces a function to inspect the level of its (only) parameter and combine this with the level that is specified as its safety-level. The combination of the levels is the result of the function call. (Small note: this behavior is (currently) only supported for functions with a single parameter.)

So from now on, when the following configuration is used:

function: addslashes       level: escaped-slashes +
function: htmlentities     level: escaped-html +

the example above is not flagged anymore because the call to addslashes is annotated with its own safety-level (EscapedSlashes), as well as the safety-level of its parameter (EscapedHTML). A pretty useful feature I would say.

Merging fun

Yesterday, commit number 379 and 380 introduced a renewed implementation of the constant-propagation, and new functionality for finding vulnerabilities. You guessed it, the merging of constant-propagation and vulnerability analysis has taken off! The cool things are that 1) the old tests all pass, 2) some new tests pass and 3) I get to write a lot more tests!

The main reason for starting the integration of both analysis was the fact that I saw a lot of code duplication popping up. This duplication was caused by the fact that the bookkeeping of internal structures is the same for all strategies, the code only differs for the properties of values.

With this last piece of information I started to merging. My first approach consisted of trying to pass a list of get-set strategies to the main-strategy and calling these strategies dynamically. This was obviously not a very good or nice start because it always seem to result in numerous segmentation faults.

Thinking about the problem made the second attempt somewhat more pragmatic. Instead of generalizing I just wanted to make it work for constant propagation in combination with a second analysis. So I rewrote the strategy to receive two strategies, one for getting the properties of literals and one for getting the properties of operators. The choice for these language constructs is based on the reasoning that these constructs are the only ones that make or manipulate the actual properties of values. The other language constructs manipulate the flow of the values instead of the actual values.

When this all seemed to work the more difficult challenge had to be solved, manipulation of variables and arrays. It turned out to be more simplistic then I thought because of the indirection between the variables and their values. For the purpose of dealing with aliasing, a variable does not point to a value but rather to a value-identifier. This identifier points to the actual value. This makes the creation of a reference easier, we just create a mapping from a variable to the value-identifier of the referenced variable. Because of this indirection we can simply make the value-identifier point to more then one property, implemented by a dynamic-rule for each property. This makes the merging of sets of dynamic rules a bit harder, but not impossible.

It might sound simple (or incomprehensible), but getting everything right was still a bit tricky. For example, when an assignment is made the property of the RHS must be known before the LHS can be assigned this property. So what happens when the constant propagation cannot compute a value, should we simply fail to assign a property?
The answer is no, the second analysis might still be successful. These kind of little problems made the implementation a little less straight-forward, and the code a little less beautiful.

However, the result of it all is that the following example is now flagged correctly:

<?php
 $foo = $_GET['asdf'];
 $bar = 1;
 $bar =& $foo;
 echo $bar;
?>

In this case, echo $bar will be flagged by the latest php-sat.

I experienced one problem with the implementation that is related to the semantics of Stratego. My first attempt in adding an annotation to a term was something like this:

 add-php-simple-value(|val):
    t{a*} -> t{annos}
      where  b*    :=  a*
           ; annos := [PHPSimpleValue(val) | b*]

This works perfectly, the annotations are matched as a list by the *-syntax, and a list is added as an annotation to the term again. The only problem with this is that the second time this rule is applied it matches the annotations as a list of a list of annotations, which was not the behavior I desired. This problem is easily solved by also adding a * to build the term:

 add-php-simple-value(|val):
    t{a*} -> t{annos*}
      where  b*     :=  a*
           ; annos* := [PHPSimpleValue(val) | b*]

Now the list of annotations is not wrapped in an actual list anymore. I know it is documented somewhere, but this little explanation might save some others from an headache or a long debug-session.

The next step in the analysis for vulnerabilities is a rather important one: testing. Even though the basic parts of variables and assignments are already tested, there exists a large number of scenarios that need to be tested on this new integration-strategy.
But hey, testing is fun!

Thinking of Merging

I have been dealing with some challenges in the last couple of days. For example my practical assignment for DBA, to be implemented in MIL, and styling the GUI for my thesis, which is coming along just fine I guess. Both of these tasks are easy for people who invented/work with it every day, but I sometimes find it hard to wrap my mind around the problem or keeping track of every detail. Good luck for me there exists a topic where I can work on without repeatedly searching on Google, analyzing PHP!

Speaking of analyzing PHP, Martin has written a blog about the generation of rules for operator-precedence in PHP. He mentions some interesting ideas for work on grammar engineering, so if you are looking for a (thesis)-project you should definitely check it out. Otherwise it is just an interesting post to read.

As for my work in operators in PHP, I have finished the rest of the operators in PHP-Sat. Furthermore, the implementation of the constant-propagation regarding operators is revised within PHP-Front. This is done because I am thinking about merging the constant-propagation and the safety-type analysis into one big analysis. I know, it sounds like premature optimization (a.k.a. the root of all evil), but I can explain why it is necessary.

Consider the following piece of code:

  $foo = array(1,2);
  $foo[] = $_GET['bar'];
  echo $foo[2];

When we consider the constant-propagation we first assign the values 1 and 2 to the first two indexes of the array. The value of $_GET['foo'] is then assigned to the third index of the array which is the parameter to echo in the last statement. We know that the value is assigned to the third index because PHP-Front keeps track of the internal index-count of arrays.
Now lets look at the safety-type analysis. We first assign the safety-type IntegerType to the first two indexes of the array. The safety-type of $_GET['foo'] is then assigned to the third index of the array which is the parameter to echo in the last statement. We know that the safety-type is assigned to the third index because PHP-Sat keeps track of the internal index-count of arrays.

You might have noticed that both paragraphs are almost identical, except for the kind of value that is assigned. Thinking about this case, cases for function calls and cases for objects it turns out that performing the safety-analysis involves a lot of bookkeeping. This bookkeeping, for example the internal index-count, is not specific for the propagated values, it encodes the internal semantics of PHP. Therefore, in order to provide a good safety-type analysis, we have to embed this bookkeeping.

In order to avoid code duplication, which might be worse then premature optimization, I believe we must merge the two analyzes together. By separating the assignment of values from the bookkeeping we can visit a node, perform the bookkeeping and then perform as much analyzes on the node as we want. The only thing needed is a list of strategies that encodes a single analysis on the node.

The idea might sound a bit vague, but while I was working on the operators I already saw some duplication creeping in. Some of the strategies for both analysis only differ on the strategies to fetch or add some value, a perfect change to generalize a bit I would say.

The operator type

My first intention was to name this entry after one of these quotes about assumptions, because in the past week I found out (again) that they are totally correct.

Last Sunday I started to work on typing the operators of PHP. My first attempt was really basic, simply follow the documentation and everything will be alright. So I started with the arithmetic operators '+','-' and '*'. They all act the same on the type level in the sense that the result is always integer, unless there is a float involved. The next operator in line was division ('/') which always returns an float (according to the documentation). I finished off with the modulus operator which behavior turns out to be ...... not documented.

Since I want PHP-Sat to follow the semantics of PHP I needed to know the exact semantic of this operator in PHP. Naturally this is not hard to figure out because we simply run PHP on some test-data and use the function gettype. I wrote a little script with some lines like:

...
echo "Type of 5/2 = ", gettype(5/2), "<br />";
...

which is not only tedious, but also a complete violation of the DRY principle.

In order to comply with DRY I turned to the eval function of PHP. This function evaluates a string as PHP code and can actually return the result of a its computation! I used this function within a loop to dynamically calculate the result of applying an operator to different types. You can find the complete script here, but I will show the critical part here:

foreach($types2 as $key2 => $val2){
    $code   = 'return ' . $val1 .' '.$op.' '.$val2. ';' ;
    $result = eval($code);

    echo ''. gettype($result) . ' (' . showval($result)  . ') ';     
}

The foreach-loop resides within another foreach-loop looping through a copy of the array $types2. This gives us access to two different types and the current operator. These are combined within a string which is executed. We then extract the type and the result of the execution and display it.

Basically you just simply pass an array with types and a operator to the function and it will output the type and result of all possible combinations. I put the output of this script here, you can find the types and values that I used before the actual result. The first few sections contain most of the unary operators, the binary operators are organized in tables. Within the tables each row contains the first value given to the operator, each column is the second value. I apologize for for the ugly formatting, just trying to be a good programmer.

It was now a trivial task to read type(s) of the modulus operator from the table and be amazed. The result type is almost always an integer, but it can an also be a boolean type! This comes from the fact that whenever the RHS is a '0'-value the modulus cannot be calculated. Where other languages would simply halt execution, PHP returns a 'false'-value and issues a warning. This is also the case for division and is now also handled correctly in PHP-Sat.

You might wonder why I have chosen to test all of the operators instead of just the modulus. The first reason for this is that I could simply not resist the temptation, it was just to easy to add all the operators. The second reason was that the first run of the script also involved the testing of the division operator. According to the documentation it always always return a float, but the table tells us that it return either a float, a boolean or an integer! I might be able to see that the boolean return-value is not documented, it is almost certainly a mistake to divide by zero, but the documentation specifically states that division never returns an integer! Luckily (yes I believe it is a good thing), this problem is already mentioned by someone else and I hope it will be fixed before the year ends.

Back to PHP-Sat, the todo-list regarding operators has been heavily shortened. The only ones left are the assignment- and string-operators. The first group involve some dynamic rules which I want to implement when I am in a more awakened state-of-mind. For the second group I still have to figure out which safety-type is to be assigned when the safety-type of both side differ. I currently believe it should be the lowest safety-type, but please feel free to provide me with a counter-example that shows that this belief is wrong.

Results from a meeting

Last Tuesday I went to Delft for a meeting with Martin after a long period of only communicating through email/IRC. Although these form of text-communication work fine a real-life meeting will make it easier to discuss things.

One of the subjects of the discussion was my paper for the STC. I already gave my presentation for this course, but finishing the paper was not an easy task for me. As you might have noticed it is quite a challenge for me to write a text in understandable English. When the text must also contain some formal definitions and clear examples the speed of my writings quickly degenerates. However, I still managed to put together a reasonable result. After we discussed which parts needed some modifications I improved the paper in the past couple of days. The result can be found at the new talks-and-paper-page on php-sat.org. Please feel free to provide constructive feedback.
When you followed the link you might have also seen the new page for the PHP-Tools package. This package is now also build within the build-farm, thank you Martin, and available for download. If you have a cool idea for a tool please let me know via the bug-tracker.
Other topics that we discussed will help to improve the path-strategies for file-inclusion, improving the SDF-definition for HereDoc, develop more concise error-reporting and generating the priorities for precedence from the YACC-definition. I have worked a little on all of these issues in the last two weeks, but now I have better ideas to handle the problems I ran into.

A last topic of discussion was an idea I had for an algorithm to use in my thesis. This algorithm uses a set of allowed rewrite rules to ``guess'' the rule a student wanted to apply. I already wrote an example-scenario for my proposal, which is coming along great by the way, so after I reread this I will discuss it in a blog. I'll keep you posted.

PHP-Sat.org finished

It took me a ((very) long) time, but I finally finished all the texts for PHP-Sat.org. It is always a challenge to think of the right subjects and placement of texts for a website, but I believe that everything is in the right place now.

The website contains all the basic things a project need: general information, source repository, mailing-lists and bug-tracker. The current information is quit basic, but it will grow over time.

But the website also contains some extensive documentation on the following topics:

This is also the reason it took some time to finish the site, but I think that it was useful. It gave me a a change to think about several aspects of the project.

Please let me know if you find any (spelling)-mistakes in the texts, still practicing my English over here.

Pimping my environment(s)

You probably already noticed that the blog has been pimped. I have adopted the quote that was mentioned at the SUD, updated the links, added the logo and an overview of all the labels. The blog now represents more of what it already was, a place to write about the projects I am involved in. And yes, me is also a project I am involved in :)

Apart from the blog I also updated the PHP-Sat website. There is now some documentation for PHP-Sat and the bug-patterns. The documentation of PHP-Front will also be updated soon. The last thing I have to do is writing the friendly and catchy welcome page, always a difficult task.

We are traveling further away from the project when I tell you that I also updated my computer configuration. You might recall that I used to work on a virtual machine, which tend to be a bit slow. So my current configuration is a dual-boot system with Windows2000 and Fedora Core 6 (default).
The documentation for setting up the working environment was more or less a guide for myself to get everything working again. It was definitely worth the work of backing-up all my configurations. A complete compilation of PHP-Front and PHP-Sat now takes 10m47s instead of the old 23m37s.

The last environment that I have pimped has nothing to do with any of the projects, except the me-project. I have cleaned my room and moved some stuff around. It is surprising to see how many useless things I had and how much space you gain when you throw them out. Although I am one of those people that believes in: 'everything you throw away will be useful the next day, my trust in this claim is fading away. It has been two days now and I still do not need the four, 10 centimeters long, lightsabers collected from cereal-breakfast-boxes.

What's the deal with ... ?

What's the deal with the lack of updating of PHP-Sat?
The last few weeks were filled with all sorts of other interesting activities, at least for me. My days are mostly filled with doing research for my master thesis and writing my proposal. In the past two weeks my free time was filled with the SUD, the visit to Google and the Grammar Engineering Tools. So plenty of projects, but little time to work on PHP-Sat.

What's the deal with this thesis then?
The thesis is the final project/course I will have to finish before I graduate. I will do some research to validate a certain algorithm that will improve feedback in educational programs. A more detailed explanation of the subject/algorithm/approach will be put into a blog soon. It's a totally different subject, but still interesting for people in computer science as in education.

What's the deal with this Grammar Engineering Tools project?
As mentioned before, the paper that was the result of the project was already submitted. Some improvements are made and the tools are (going to be) updated. My work on this project was still in the interest of PHP-Sat though. We have gotten a great insight in which combinations of operators are valid and which are not. This information will be put on the web as soon as we have a reasonable format for it, so stay tuned!

What's the deal with those labels/link underneath the posts?
The new version of blogger allows you to have labels under a blog. I thought is would be nice to order the posts according to certain topics. People that only want to read about my thesis can look here, those that only want to read about PHP-Sat here. I hope that the people behind blogger will add a RSS-feed per label in the future, but I can only hope :)

Who's idea is it anyway?

I started with the research for my thesis proposal this week. For those who
wonder about the subject: please read on. For everybody else: skip to the next paragraph. My master thesis will explore an idea of Johan Jeuring
and Harrie Passier. The eventual goal is to provide feedback in educational tools (well actually educational tools that allow you to rewrite a certain input step by step to an answer), as if the feedback comes from a teacher. One of the hurdles that has to be taken is the guessing of which step the student wanted to take when a faulty answer is inserted in the program.
The idea for guessing this step is that you take the last correct tree and then generates all the trees that can be obtained by rewrite-rules that are allowed. You then calculate the difference between these trees and the faulty answer to find out which tree is nearest to the faulty answer. The rewrite-rule used to get this nearest tree is probably the rule that the student wanted to apply.
This simple and short explanation does not expose all the problems very well, but I will probably come back to that in a later blog. For more information about the idea you can take a look at this paper.

Last Monday I was in Leusden to meet the people from 'TeamInternet' again. Apart from going through some new code of the scoutshop, I also showed PHP-Sat to the people there. Bjarni van Berkum came up with a new idea for a bugpattern and this resulted in another idea. He told me that he was interested in the results of PHP-Sat on the code base because it would probably find the first pattern a few times. I told him that it shouldn't be hard to implement, but it turns out that this is not completely true.
I was right about the fact that it is easy to implement, for normal and static function calls that is. The retrieving of these functions are pretty straightforward because they are directly available in the environment. The problems lays in the fact that most of the codebase of TI is ObjectOriented, which means that there are lots of objects passed to lots of functions. But we currently do not descent into functions, nor do we know about objects. This poses somewhat of a problem if you want to find out which objects are used within functions. So unfortunately, there is a lot of work to be done in order to support this.

But there where more ideas that came from people at TI. Frits Zwegers talked about an idea for a tool that can give an overview of which files are needed for a given file. This can be done by collecting all directly included files and all the files that declare a class or a function that is used within the file. The implementation of this tool is also delayed because of the problem mentioned above, but I am pretty sure that it will be available some day :)

The people mentioned above already came up with great ideas, so if you also have a great idea: share it!

The first presentation

The last three days where filled with only 1 thought:
Prepare Presentation

Every student that follows the master program at the Center For Software Technology has to give a talk as part of the Software Technology Colloquium. Students usually pick a topic that is researched by other people and present projects, ideas or tools that are the result of this research.

Since the ideas and concepts of PHP-Sat could be interesting for others as well, and since I already know some things about it, I started working on the slides and preparing the presentation. On Tuesday I practiced the talk and got some pointers on how to improve the structure of the talk. I gave the presentation today and it was actually fun to do. I was a bit nervous before I started, but after a few slides the story just came out very smoothly. So the first presentation of PHP-Sat was a success, on to the next one?

(For those who are interested, the announcement and the slides can be found here)

Something to celebrate

This is the 50th post in this blog, a number to celebrate because it is a nice round number. It is also the first blog-entry that contains some real-world results, another thing to celebrate. It also comes at the time that I am reading the book The best software writing, thank you Michiel, which might help me to improve my blog-entries. Something that is definitely worth celebrating.

So let's look at some results. I found out that there 91 detections of pattern O000 in the PhpDocumentor-project. Since this suggests that you should always pre-calculate the size of an array I decided to do a little test. I first used the original sources of PhpDocumentor to extract the documentation of 'phorum', 'phpBB2' and 'phpMyAdmin'. After this I modified all the for-loops of PhpDocumentor that where flagged by PHP-Sat and extracted the documentation again.
The results where rather disappointing, maybe a single second somewhere, but nothing really substantial. When I showed this my girlfriend I realized that none of the projects have phpdoc-comments. So I found out that this optimization does not matter for projects that you would not process. Anyone else want to state the obvious?
So I repeated the experiment with the sources of PhpDocumenter itself, which are well-documented. The results for these sources are less disappointing. It turns out that pre-computing the size of the array saves 6 seconds in the frames/default configuration, and up to 9 seconds in the DOM/default configuration. These number are not very large, but it seems that this simple optimization really saves some time. If you want to know more about optimizations that are more useful, please click here

Optimization is nice, but has been discussed a lot and not very original anymore. So let's look at some results that indicate logical errors. A pattern that is not PHP-specific is C006, which is an ignored return statement. I have implemented this pattern because OWASP has an article about it. When I inspected the results I learned that Pear actually has functions that will always return 'true'. This situation is probably worth a pattern, but it also generates extra detections because the result of these functions are ignored.
But there also was an ignore that was not legit and this resulted in the first bug-report coming from PHP-Sat. Let's hope that it will get accepted and is fixed in the near future.

When I looked at the results from the C006 pattern I also read the comments in some of the files in Pear. The following piece of comment caught my eye:

.. constructor
@param ....
@param ....
@return mixed True on success else PEAR error class.

How can a constructor return either True or an other object? A constructor is supposed to return the newly created object, so a return statement in a constructor is useless. (PHP4 allows you to return something else from a constructor by assigning a value to the '$this'-variable, but this is not compatible with PHP5 or good practice.) I have also found a thread that shows that these patterns should be checked.
So I added the patterns C007 and C008 to PHP-Sat. I could not find any assignments to '$this' in the stable packages of Pear, but I found 13(!) occurrences of a return value within a constructor. Good luck for me, bad luck for the package maintainers?

And the answer is...

Did you have enjoy thinking about the problem? I definitely had fun during this weekend, pictures will be put on the website soon.

But back to the question off last friday, what will be the result? It would not have surprised me if PHP would raise a NOTICE, but this is not the case. It might have surprised me if PHP used the last statement in the definition as the result, but this is also not the case. The variable $result is just null, not initialized, nothing happens! This is probably not the intention of a programmer, either the function is misspelled or the function is not correctly implemented. So this pattern is added to the correctness-category and has id C005.

I have also worked on the constant-propagation again. There is now support for superglobals and the list-statement. The support for superglobals also implies that every variable is also accessible through the $GLOBALS array.

Getting through the weekend

My upcoming weekend will be filled with a lot of fun and excitement, I have another World Jamboree troop-weekend. Check out this site to see the other people of my troop.

For the people that can not spend a weekend without wondering about PHP I have a little exercise. What is the output of the following piece of code:

<?php 
  function foo($param1, $param2){
    $param1 + $param2;
  }
  
  $result = foo(1,2);

  echo $result;

?>

Will it be an error, the result of the function or nothing at all? Don't spoil the fun by running it in PHP without thinking about it! The answer was a little surprising for me.

I proudly present: the logo

The PHP-Sat logo is made by Robert van Geenhuizen.
I am very grateful that he took the time to develop this logo. I think it is 'compleet hip', which means so much as 'very trendy'.

Robert made this logo with a concept in mind. The following quote describes this concept:


The creation of the PHP-Sat logo is based 
upon the following:

Develop a conceptually strong logo that 
uses modern illustration techniques to 
make a simple, yet strong ideograph.
This results in a "Bug" stopped by a 
imaginary (debug)-filter, the conceptual 
base principal behind PHP-Sat. 

The logo consists of many complex items 
with gradient mesh and three tints, but 
together they still form the basis for 
this strong and modern ideograph.

The statement is very hard to translate, so this is the statement in Dutch. If anyone can provide me with a better translation, please let me know.


Bij de creatie van het PHP-SAT logo is 
uitgegaan van het volgende:

Een conceptueel sterk logo neerzetten 
dat door middel van moderne 
illustratietechnieken een modern maar 
toch simpel en sterk beeldmerk is. 
Dit resulteert in een Bug die door een 
denkbeeldig (debug)-filter vliegt, het 
conceptuele basisprincipe achter PHP-SAT.

Het logo bestaat uit veel complexe items 
met verloopnetten en 3 kleurtinten, maar 
toch vormen zij samen de basis voor dit 
sterke en aanwezige, moderne beeldmerk.

Here it comes:

php-sat logo

What is going on?

If you have been wondering about what is going on, here is the answer. I have attended a conference of the NLUUG last Thursday. I actually worked there to check badges, but I could also attend the talks that where given. The main reason for being at the conference was the change to talk to people that would probably be interested in PHP-sat. I gave a demo to some of the partners of madison-gurkha, which went well I think. They where introduced to me by Armijn Hemel, who has a brother (Tim) that works there. I already presented the tool to Tim last Monday and he seemed to like it too. He is even learning Stratego to be able to adapt the tool.

Armijn also helped me out with some typo's in the documentation and with testing the tool. He found some things that where not support by the SDF. Some of them where rather easy to fix, others require an update of Stratego or a post-processor for the parsing.

If you have checked the issue tracker recently you might have noticed things are a bit more organized now. I have added some milestones to be able to plan more. So you can check out the roadmap to see what is going on.

Another thing that is going on is the creation of a paper. I have to write a paper for the STC and Martin suggested that it would be nice to write for a real conference as well. I will try to get it published so that the project will get an even more solid foundation. But I will certainly talk about the project at the STC and maybe also at the Stratego User Days.

To conclude this entry with a cliffhanger, the logo for the project should be ready very soon!

Anything new?

Sorry for the late update, college is really getting started so I have to get used to going to lectures and meetings again. But my php-*-projects are still going strong. The evaluation strategy can handle construction of, assigning to and reading entries from arrays now. An array in the interpreter is modeled as a map which maps keys to values. Some useful strategies have been implemented to easily add and retrieve values from the arrays. The documentation about arrays tells us that PHP also sees an array as a map, so the implementation of the semantics was not that hard. The only thing that it does not support (yet) is references. These will be added later.

The second thing that has been added is the fix for the long lasting psat-2 bug-entry. This states that HereDoc should be parsed as a sequence of literals and escapes, just like double-quoted strings. This turns out to be very tricky. Double-quoted strings have a very straight-forward start and end-point, HereDoc does not have this. The end of a HereDoc is marked with a newline-label-newline sequence. So the newline was added as an escape in a literal, we have to treat this newline in a special way if it is followed by a label. The problem that arises here is that we cannot express this within SDF. This should be written down as a follow-restriction on the newline, but then we would have to know the length of the label. This length is not fixed, so we can not do this. By not setting the follow restriction a literal became ambiguous when there was a variable in front of it. The solution here was to write out the definition of 'a list of literals or escapes'. This definition was extended by 'where there are no two literals in a row'. This solved all the problems for that moment and everybody was happy!
This happiness lasted until I tried to parse a file with a statement after the HereDoc, this failed. So a statement after a HereDoc caused the parser to keep parsing until the end of the file, something that we definitely do not want. So the problem here was that the end of a HereDoc was not strict enough. I tried several things, but nothing seemed to work. The optional semicolon was causing big problems. It turns out that this optional semicolon was a misinterpretation of the documentation. The documentation mentions this semicolon and I thought that it was part of the end of the HereDoc. But this semicolon is only allowed because the HereDoc is probably part of a bigger expression that needs to be closed by the semicolon. So I simplified the HereDoc-end and this allowed a nice follow-restriction. All the problems are over and everybody is happy again :)
So there is now a more expressive support for HereDoc, with a restriction. This was already true with the first implementation, but never really mentioned. The problem is that one can not express the fact that the labels of the HereDoc should match within SDF. So HereDoc is parsed from the first open label until the last closing label, even if there are statements in between. This restriction is unfortunate, but reality for all parsers that are based on SDF. It is simply not possible to solve this within the syntax definition. It can be solved by adding a post-processing step, so an issue for this is already created.

The finish this entry with some good news, there are windows-builds for both php-front and php-sat! It took us some time and some hackery implementation of stratego-routines to get everything right, but we succeeded. The hackery stuff will be moved to the stratego libraries as soon as possible. I have tested the build on my own machine and it seems to work fine. Please try it out if you have a windows machine that you can (ab)use. Windows-builds for stratego programs are relatively new, so there could be some hidden problems.

Thanks and progress

I think that there are many people with great ideas for projects. Most people do not get around to actually starting up these projects. It takes a lot of time which is usually not available. Without the Summer of Code I would not have had the time to start the project, let alone work on it for so many hours. Thank you Google!

The project would not be in his current state without my mentor. He helped me in setting up the development environment and automating the build- and test-process. Our meetings motivated me and helped me in keeping focused. I would recommend all (future) participants of the Summer of Code to meet with his/her mentor face-to-face, or at least in some interactive way. It helped me a lot, thank you Martin!

The following section is taken from my evaluation for the Summer of Code. I think it nicely summarizes the current progress of the project. The project has produced the following two libraries, together with tools to interface with them.

PHP-Front is a library with support for parsing and pretty-printing php, reflection of parsed sources, some generic traversals and a simple evaluation. This part is available as a separate package which provides a solid basis for transforming or inspecting PHP source code. I think that there will be more projects that are going to use this package for this purpose. One of the projects that I am already aware of is StringBorg.

PHP-SAT is the library that actually performs an analysis on the given source code. It tries to detect 7 bug patterns, more will be implemented later. It also check pre-conditions for functions and language construct to detect possible vulnerabilities. This last analysis will be improved over time.