Pimping my environment(s)

You probably already noticed that the blog has been pimped. I have adopted the quote that was mentioned at the SUD, updated the links, added the logo and an overview of all the labels. The blog now represents more of what it already was, a place to write about the projects I am involved in. And yes, me is also a project I am involved in :)

Apart from the blog I also updated the PHP-Sat website. There is now some documentation for PHP-Sat and the bug-patterns. The documentation of PHP-Front will also be updated soon. The last thing I have to do is writing the friendly and catchy welcome page, always a difficult task.

We are traveling further away from the project when I tell you that I also updated my computer configuration. You might recall that I used to work on a virtual machine, which tend to be a bit slow. So my current configuration is a dual-boot system with Windows2000 and Fedora Core 6 (default).
The documentation for setting up the working environment was more or less a guide for myself to get everything working again. It was definitely worth the work of backing-up all my configurations. A complete compilation of PHP-Front and PHP-Sat now takes 10m47s instead of the old 23m37s.

The last environment that I have pimped has nothing to do with any of the projects, except the me-project. I have cleaned my room and moved some stuff around. It is surprising to see how many useless things I had and how much space you gain when you throw them out. Although I am one of those people that believes in: 'everything you throw away will be useful the next day, my trust in this claim is fading away. It has been two days now and I still do not need the four, 10 centimeters long, lightsabers collected from cereal-breakfast-boxes.

String representations

With revision 299 we (again) have a list representation for DoubleQuoted- and HereDoc-strings. So the string "hello \t world" is represented by:

ConstantEncapsedString(
   DoubleQuoted([Literal("hello ")
                ,Escape(116)
                ,Literal(" world")]
))


This not completely new because the first implementation already had this, but this representation had some problems. There was no way to model a hexadecimal escape without making things ambiguous. This problem can be solved by making the order of the internal parts explicit, but we then had a terrible representation of the string:

ConstantEncapsedString(
   DoubleQuoted(
     DQContent(Some("hello ")
              ,Escape(116)
              ,Some(" world"))
))

Note that this string is represented with 1 of those DQContent-thingies, the biggest one has three children. So every string with more then 3 parts has nested DQContents. Let me give you an example, the string "Hello \\\0123" looks like:

ConstantEncapsedString(
   DoubleQuoted(
   DQContent(Some("Hello ")
              ,DQContent(Escape(92)
                        ,None
                        ,OctaChar(48,49,50))
              ,Some("3"))
))

Terribly right?

So this had to be solved by a post-processing step. We have to walk over the tree bottom-up to rewrite these nasty DQContent's to a nice list. It is not nice to have such a post-process-step, but this was already required because of HereDoc-strings.

The problem with the HereDoc is analogous with the problem of the Dangling-else. If you have multiple HereDoc-strings with the same label you will have to choice where the first HereDoc ends. PHP always takes the shortest HereDoc so this piece of code has two variables:

<?php
   $foo = <<<BAR
     foo...
  
BAR;
  
   $bar = <<<BAR
     bar
  
BAR;
?>

As long as HereDoc is ambiguous this is easily solved by choosing the right amb-node.

But after the rewrite to the new internal implementation, HereDoc became unambiguous! This is a bit frustrating because it takes the longest HereDoc, which is wrong. I spend some time in trying to get it right, but I could not make it work (yet). So a new puzzle has entered the project, happy christmas! :)

P.S. As said in the last blog, I can hope, but maybe I should just read.

What's the deal with ... ?

What's the deal with the lack of updating of PHP-Sat?
The last few weeks were filled with all sorts of other interesting activities, at least for me. My days are mostly filled with doing research for my master thesis and writing my proposal. In the past two weeks my free time was filled with the SUD, the visit to Google and the Grammar Engineering Tools. So plenty of projects, but little time to work on PHP-Sat.

What's the deal with this thesis then?
The thesis is the final project/course I will have to finish before I graduate. I will do some research to validate a certain algorithm that will improve feedback in educational programs. A more detailed explanation of the subject/algorithm/approach will be put into a blog soon. It's a totally different subject, but still interesting for people in computer science as in education.

What's the deal with this Grammar Engineering Tools project?
As mentioned before, the paper that was the result of the project was already submitted. Some improvements are made and the tools are (going to be) updated. My work on this project was still in the interest of PHP-Sat though. We have gotten a great insight in which combinations of operators are valid and which are not. This information will be put on the web as soon as we have a reasonable format for it, so stay tuned!

What's the deal with those labels/link underneath the posts?
The new version of blogger allows you to have labels under a blog. I thought is would be nice to order the posts according to certain topics. People that only want to read about my thesis can look here, those that only want to read about PHP-Sat here. I hope that the people behind blogger will add a RSS-feed per label in the future, but I can only hope :)

Operator precedence

So what was the curious remark in the last blog? What is the interesting functionality that I have made? The title gives the area of the functionality away, 'Operator precedence'.

The problem with PHP, and most grammars in general, is that the operator precedence is usually ill-documented. There is some documentation in a table, but the real reference is the implementation itself. Martin has a really nice idea about how you can check two grammars on having the same operator precedence, simple yet elegant.
If you take two definitions of a grammar, for example one defined in YACC and one defined in SDF, you can extract what is allowed and what not. The process of extracting the precedence rules that encode the behavior from these formalisms is written down in the paper that is produced, but I am not sure whether or not this is put on the web before we know if it is accepted to LDTA 2007. I do not even know if I can explain this very clearly in one post. So I can not provide a link to the paper, but this might be an interesting subject to blog about for Martin. (Yes, this is a hint!)
After extracting the precedence-rules you have to rename constructs and filter extensions from the rules. Finding the exact rewrites that are necessary is easy if you first extract the production-rules that are possible. This reduces the set that you are looking at from 3000-5000 to 30-50 rules, which is much easier to examine. After the rewriting you can compare the two sets of precedence-rules by a simple diff, a built-in strategy.

So what did I do exactly? Well, I worked on the actual tools in order to get some result, and we definitely did! We found several (precedence) problems in C-Transformers and the SDF-library, which both target C. This demonstrates the great power of the tools, because I am not that familiar with C myself!

We also found some precedence problems in the PHC, which were fixed very shortly after the report was send in. I was aware of the fact that there are some problems in PHP-front regarding operator precedence. In fact, it was one of the reasons for making these tools. The number of about 400 precedence problems was a still bit overwhelming at first. But most of these problems are due to the same operators and just produce about 49 warnings because all of the other operators report an error on it. We still have a lot of work to do, but we now have tool-support for checking the precedence rules!

Back from Dublin

The trip to Dublin can only be subscribed as really great. My dad and I had a great time and I can really recommend it to everyone to visit this divers city.

We took an Aer Lingus flight from Amsterdam to Dublin and where only 30 minutes late, but this is something you get used to when you use the public transportation in Holland. The cab that we took drove straight through Dublin, the city is really huge! Traffic all over the place (on the wrong side of the road), almost no bikers and crowded streets. After we checked in we spend the afternoon walking around the main streets of Dublin and I went to Trinity to meet Edsko.

There seemed to be some confusion about the time of the meeting, but eventually Edsko and John showed up so we could get something to eat at the Mona Lisa. The food and the conversation was really nice. We talked about parsing PHP, the differences between the internal representations and living in Dublin. We concluded that the projects are not really compatible because the internal representations are really different. These internal representations of the PHP-sources can be made compatible from PHP-Sat to PHC, but not the other way around. The goals of the projects are also quit different, but we could probably learn something from each other. Although this conclusion is a bit of a disappointment I had a really nice time. Edsko also showed me some of the inside of Trinity College, which is definitely worth a visit. Thank you Edsko and John for a fun evening, I hope we meet again sometime.

The visit to Google was on Tuesday at 12 o'clock, so we still had to fill the morning with something useful. We visited to the national gallery which had an interesting exposition about the Irish culture in the last 200 years. They also have a (very large) collection of other paintings which were less interesting to me, but my dad seemed to like them.

And then it was finally time to go to the Google Office. Leslie could not make it because she was sick, but Rob Holland was kind enough to take over the coordination. We started the visit in the game-room which is filled with video-games, a snooker-table and a massage chair. We left our bags there and Rob showed us around the floor and the different teams. Everybody seemed to be busy, but they also took the time to say 'hi'.
After the tour we went downstairs to the restaurant. There was plenty of food (all free), drinks (all free) and ice-cream (again all free). The conversation with the engineers during lunch was very interesting. They have done some fun (and dangerous) experiments with various (expensive) toys, but they also work very hard.
We finished the visit with the lightning presentations of our projects. The topics of the projects were pretty far apart, but it is good to broaden your horizon.
The visit confirmed my expectations that working at Google is pretty cool, but you will still have to work hard. This is not so bad because the people seem to be very nice and intelligent and the atmosphere is great. Thank you Google for making this visit possible!

We flew back on Wednesday after visiting the Guinness-brewery, you just have to visit this brewery when you are in Dublin.

The overall conclusion is that the trip was fun, exciting and really interesting. It is hard to describe everything in words, but it was definitely cool!

But onto the next challenge. After spending about one hour figuring out all the dependencies for some yacc-converting tools I am going to hack some interesting functionality together. Stay tuned for more information about this incomprehensible remark.

SUD 2007 roundup

Warning: rather long story with a lot of my own opinions up ahead!

I attended my first Stratego User Days this week without really knowing what to expect. I had seen the titles of the talks, but some of them still made me wonder about what was going to be presented. So I took the train to Delft with sleepy eyes and a blank mind.

If you think that the SUD is like a conference then you are wrong, it is more of an informal gathering of people that use Stratego. They explain to each other what they do, how they handle problems and what they would like to see in the next release of Stratego. At first this gave me the idea that the room would be filled with _all_ of the users of Stratego, which was probably true for some of the earliest SUD's, but the first presentation of the day already proved that this idea was wrong. Martin Bravenboer started the SUD with a presentation about the current status of Stratego. He did not only explain how hard they worked on the 0.17 release, but also showed a list of papers and projects, including one complete slide about PHP-SAT, that use Stratego. Some of the people on this list use Stratego without any help from 'the core people', Martin and Eelco, so this probably shows that Stratego is catching on. Martin also mentioned that they are several (Phd) positions to fill, so if anybody is interested they should contact him.

Eelco Visser gave a presentation about the new compilation scheme of Stratego.The presentation was a bit too technical for me, but it showed some nice goals and resulted in a (short) discussion. The discussion ended at the moment that Eelco gave the right example by getting himself some coffee.

Martin continued after the break with a presentation about the new library structure of Stratego. He explained why almost all of the functionality is moved to the library in order to target the portability problem. The funny thing is that some of the presentations that where about to come would complain about this problem, which shows that the needs from the community are actually being fulfilled. I liked the fact that Martin gave lot's of examples that used php-front to illustrate the new features, always nice to see your own stuff used.

We had lunch a little late because we already hopelessly behind schedule, but it was still very nice. I have to say that I like the cafeteria in Delft, the money/food ratio is pretty good. The presentation of PHP-SAT started after the lunch and went very well. The people looked interested even after the moment that the laptop shut himself down because the battery was empty. Karl Trygve Kalleberg mentioned that IBM also had a project about static analysis of source code called Wala, but he did not remember whether it supported PHP. So I tried to find the PHP-part, but I could not find it so it is probably not supported.

Benoit Sigoure talked about his project in which he extended PRISM to deal with real life problems. I always enjoy it when somebody talks about projects that are really useful to them. He also mentioned some problems that he encountered and gave a 'wish list' of items he wanted to see in Stratego. I have experienced most of the problems he has and I totally agree with the fact that he wants some more static checking. One of the other wishes was a debugger and I would like to put this on top of the list. Being able to step through my Stratego program is something that would help me a lot, and others as well.

During the coffee break that followed the presentation of Benoit one of the girls in the room next-door asked me what kind of meeting we had. I explained the concepts of the SUD to here and she replayed with the phrase: ..I already thought that it was something with programming, you are all wearing those nerdy-code-t-shirts.., thank you very much indeed.

After I had probably been insulted by the girl, Mikal Ziane and Nicolas Pierron talked about Lutin. They use java-front and dryad to do some kind of code refactoring, but it wasn't completely clear to me. I think I might have understand things better if we weren't interrupted by the fire-alarm. It gave us a change to see some of the campus and the other people in the building, but it didn't help us to keep up with the schedule.

The presentation of Wouter Caarls about embedding Stratego in C showed another thing that can be done with Stratego. The techniques he used where not very complicated, most of them where also covered in program transformation course in Utrecht, but the combination was interesting.

Valentin David his presentation ended the day by giving an overview of the current C++-front-ends. This presentation was not very interesting to me because I (currently) do not use C++, but I can at least find it again when I need to. But during the talk he mentioned semantic designs, which offers support for analysis. They also have a front-end for PHP, so I probably should take the time to take a look at this.

The second day of the SUD started with a presentation about a system that comparable with Stratego, but written in Java. The system is called TOM and it has some very nice properties. They borrowed some features of Stratego and I hope we also will borrow some of their features. The small features, like matching on a sort without specifying the number of children or the not-match, are most likely not difficult to implement but useful additions to Stratego. They also showed an eclipse-plugin for their project and a graphical debugger, great things to have and very useful. Another project that is added to the list of things to checkout.

Another connection between Stratego and Java was presented by Karl Trygve Kalleberg. He showed the Spoofax project, which also holds a plugin for Eclipse. I haven't really thought about looking at this for the syntax highlighter for Context, but it might be a good idea to check how he did this.

Bernd Fischer gave a presentation about what he wanted to get from the Stratego community. Some of the ideas could also be useful for other Stratego developers, but most of his wishes could probably be solved by implementing a separate library for ACI1-terms. Some of the problems that he mentioned are actually handled by MathPert, so he might want to take a look at it.

The talk of Alexandre Borghi about vectorization was a bit to technical for me. I think I understand why you want it, but could not figure out completely how everything worked. I do not have a problem with this, I do not intend to use it in the near future, but other people will certainly find it interesting.

A presentation that was really interesting for myself was the presentation of Bogdan Dumitriu. He did his master thesis on improving support for data-flow transformations for Object-Oriented programs. Many ideas from his thesis and his talk are very useful for me and the PHP-Sat project, so I am definitely going to read the complete thesis. His support for break- and continue-constructs can easily be added to PHP-Sat and his ideas about customized transformations are definitely cool. As soon as a get a copy of the thesis, and some spare time, I will write a blog about these cool subjects.

My own version of the SUD ended with the presentation of Karl Trygve Kalleberg and Valentin David in which they present some ideas for extending Stratego. Some of the extensions, like an attribute grammar system, are already implemented in Transformers and look interesting. Other proposal are still in the 'this-would-be-a-nice-idea'-stage, so we will have to wait and see what the future brings us.

To conclude this roundup I wanted to say that the whole experience was very cool. It was nice to see what other people are doing with Stratego and was a good opportunity to do some feature-requests. I am already looking forward to the SUD of next year.

The SUD also gave raise to a great quote coming from Pierre-Ettienne Moreau which I probably will going to use more often. The quote displays a great sense of a pragmatic attitude which really appeals to me. During the presentation about TOM Pierre-Ettienne explained some side-effect that could occur during the application of a strategy. After someone asked him whether it was pure he replied:

..it is not pure, but it is practical.

Interesting things

I noticed that it was already a week ago since I wrote my last blog, time flies when one is having fun. So this week should go really fast because all sorts of interesting things are going to happen!

The first interesting event is going to be the Stratego User Days where I will be presenting PHP-Sat. This presentation will be a bit more technical then my STC-presentation, but it will also contain the more theoretical stuff (I think).

The other even that is interesting, certainly for me, is that I am going to Dublin next Monday. I will be staying there for three days and I am going to do at least two interesting things there. I am part of the GSoc-group that is going to visit the Google Office. Someone else who was going to come was Paul Biggar, one of the people behind the Phc. It is to bad that I cannot meet him to talk about some of the problems/choices/gizmos involved in parsing and transforming PHP-code. But I will be meeting the other two authors of Phc. I am really looking forward to this meeting, it's going to be very interesting.

I promise that I will write about all of these events, so don't worry about missing all this interesting stuff. I am even considering to subscribe to something like Flickr so that I can share some pictures with you, although I think that the SUD will be covered by Eelco Visser.

For those people that want to read about something that actually happened already I have a little Doh-anecdote. I spend a day working on PHP-Sat together with Martin to fix the last DoubleQuoted string-ambiguities. The problem was that there where some literals that kept breaking apart within strings that where used for regular expressions. The reason for this was a missing follow restriction, so we fixed it and where very happy with ourselves. So I took another look at the problem yesterday and noticed that there where still more situations that showed this behavior. After about an hour I realized that we already fixed this problem for HereDoc by writing out the allowed order of literals and escapes. But I can not remember why we didn't do this for DoubleQuoted strings, isn't that interesting?

Who's idea is it anyway?

I started with the research for my thesis proposal this week. For those who
wonder about the subject: please read on. For everybody else: skip to the next paragraph. My master thesis will explore an idea of Johan Jeuring
and Harrie Passier. The eventual goal is to provide feedback in educational tools (well actually educational tools that allow you to rewrite a certain input step by step to an answer), as if the feedback comes from a teacher. One of the hurdles that has to be taken is the guessing of which step the student wanted to take when a faulty answer is inserted in the program.
The idea for guessing this step is that you take the last correct tree and then generates all the trees that can be obtained by rewrite-rules that are allowed. You then calculate the difference between these trees and the faulty answer to find out which tree is nearest to the faulty answer. The rewrite-rule used to get this nearest tree is probably the rule that the student wanted to apply.
This simple and short explanation does not expose all the problems very well, but I will probably come back to that in a later blog. For more information about the idea you can take a look at this paper.

Last Monday I was in Leusden to meet the people from 'TeamInternet' again. Apart from going through some new code of the scoutshop, I also showed PHP-Sat to the people there. Bjarni van Berkum came up with a new idea for a bugpattern and this resulted in another idea. He told me that he was interested in the results of PHP-Sat on the code base because it would probably find the first pattern a few times. I told him that it shouldn't be hard to implement, but it turns out that this is not completely true.
I was right about the fact that it is easy to implement, for normal and static function calls that is. The retrieving of these functions are pretty straightforward because they are directly available in the environment. The problems lays in the fact that most of the codebase of TI is ObjectOriented, which means that there are lots of objects passed to lots of functions. But we currently do not descent into functions, nor do we know about objects. This poses somewhat of a problem if you want to find out which objects are used within functions. So unfortunately, there is a lot of work to be done in order to support this.

But there where more ideas that came from people at TI. Frits Zwegers talked about an idea for a tool that can give an overview of which files are needed for a given file. This can be done by collecting all directly included files and all the files that declare a class or a function that is used within the file. The implementation of this tool is also delayed because of the problem mentioned above, but I am pretty sure that it will be available some day :)

The people mentioned above already came up with great ideas, so if you also have a great idea: share it!

The first presentation

The last three days where filled with only 1 thought:
                  Prepare Presentation

Every student that follows the master program at the Center For Software Technology has to give a talk as part of the Software Technology Colloquium. Students usually pick a topic that is researched by other people and present projects, ideas or tools that are the result of this research.

Since the ideas and concepts of PHP-Sat could be interesting for others as well, and since I already know some things about it, I started working on the slides and preparing the presentation. On Tuesday I practiced the talk and got some pointers on how to improve the structure of the talk. I gave the presentation today and it was actually fun to do. I was a bit nervous before I started, but after a few slides the story just came out very smoothly. So the first presentation of PHP-Sat was a success, on to the next one?

(For those who are interested, the announcement and the slides can be found here)

Something to celebrate

This is the 50th post in this blog, a number to celebrate because it is a nice round number. It is also the first blog-entry that contains some real-world results, another thing to celebrate. It also comes at the time that I am reading the book The best software writing, thank you Michiel, which might help me to improve my blog-entries. Something that is definitely worth celebrating.

So let's look at some results. I found out that there 91 detections of pattern O000 in the PhpDocumentor-project. Since this suggests that you should always pre-calculate the size of an array I decided to do a little test. I first used the original sources of PhpDocumentor to extract the documentation of 'phorum', 'phpBB2' and 'phpMyAdmin'. After this I modified all the for-loops of PhpDocumentor that where flagged by PHP-Sat and extracted the documentation again.
The results where rather disappointing, maybe a single second somewhere, but nothing really substantial. When I showed this my girlfriend I realized that none of the projects have phpdoc-comments. So I found out that this optimization does not matter for projects that you would not process. Anyone else want to state the obvious?
So I repeated the experiment with the sources of PhpDocumenter itself, which are well-documented. The results for these sources are less disappointing. It turns out that pre-computing the size of the array saves 6 seconds in the frames/default configuration, and up to 9 seconds in the DOM/default configuration. These number are not very large, but it seems that this simple optimization really saves some time. If you want to know more about optimizations that are more useful, please click here

Optimization is nice, but has been discussed a lot and not very original anymore. So let's look at some results that indicate logical errors. A pattern that is not PHP-specific is C006, which is an ignored return statement. I have implemented this pattern because OWASP has an article about it. When I inspected the results I learned that Pear actually has functions that will always return 'true'. This situation is probably worth a pattern, but it also generates extra detections because the result of these functions are ignored.
But there also was an ignore that was not legit and this resulted in the first bug-report coming from PHP-Sat. Let's hope that it will get accepted and is fixed in the near future.

When I looked at the results from the C006 pattern I also read the comments in some of the files in Pear. The following piece of comment caught my eye:

.. constructor
@param ....
@param ....
@return mixed True on success else PEAR error class.

How can a constructor return either True or an other object? A constructor is supposed to return the newly created object, so a return statement in a constructor is useless. (PHP4 allows you to return something else from a constructor by assigning a value to the '$this'-variable, but this is not compatible with PHP5 or good practice.) I have also found a thread that shows that these patterns should be checked.
So I added the patterns C007 and C008 to PHP-Sat. I could not find any assignments to '$this' in the stable packages of Pear, but I found 13(!) occurrences of a return value within a constructor. Good luck for me, bad luck for the package maintainers?

A Doh-situation

I had a real 'Doh'-situation yesterday, it happened when I was trying out php-sat on several PHP-projects. I began with trying php-sat on the sources of Phorum, phpBB, phpMyAdmin and Joomla.

The results for these projects are very nice, considering the current status of the tool. I was surprised to see that the pattern C003, which is very simple and even documented in the PHP-documentation, was actually flagged in both phpMyAdmin and Joomla. The functions to blame can be found in Table.class.php (generateAlter and generateFieldSpec) from the phpMyAdmin-source, and database.php (database) from the Joomla-source.

A pattern that was found quiet often in phpMyAdmin is pattern O001.
I have done some simple tests and it turns out that passing multiple parameters is roughly twice as fast as using concatenation. This blog-entry confirms this by explaining what happens behind the scenes in both cases.
So the pattern describes a possible optimization that is not very useful if this patterns occurs only once. But if it happens in 32 cases in a single source file that is called common.lib.php, which is probably used commonly in the project, then it might be useful to take a look at it.

But back to the 'Doh'-situation. After trying php-sat in the projects I wanted to run it on a large php-codebase. A great resource of PHP-code can be found in the PEAR repository. The PHP-code for PEAR itself is about 30000 LOC so I gave this to php-sat first. I was not surprised that it took a while because the code is pretty large. But I got a little suspicious after about 10 minutes of just seeing files being parsed. So I put some more log-messages in the php-front-library and tried again. It turned out that php-sat was including 3 files in a cycle, but the includes happened because of a 'require_once'. I was pretty sure that I had fixed this problem by checking whether a file was previously included, and this was indeed the case. The only problem was that I put the files that where included in the environment after going into a file. The solution is pretty obvious, but such situations just call for a slap on your own forehead.

Propagation

So this week was still filled with constant propagation techniques, but also with software metrics. I have given a presentation about functional software metrics which was based on this PhD thesis. It is very hard to find papers about software metrics for functional languages, but it is even harder to find them for scripting languages! Please let me know if you know a website that covers metrics specifically for scripting languages, I don't expect there to be any papers about this subject.

I am looking for software metrics for scripting languages because I think that metrics can help in finding the weak spots in a program, but there has to be evidence to support this. Validation of software metrics is rather time-consuming, so hopefully done by somebody else :) It shouldn't be hard to make a tool based on php-front to generate a rapport with software metrics about your program, any volunteers?

But back to the constant propagation. I have added support for the propagation within loop-constructs, if-statements and switch-statements. This was not very hard because I could use the techniques described in this paper. The construct that was not discusses in this paper, or on any slide of the pt course (which is not given anymore, unfortunately), is the if-elseif-else construct. It is not very hard to come up with the (hopefully correct) semantics for this, but I could be wrong of course. I will explain my interpretation in the next section, but it might be hard to follow. Feel free to ask for a better explanation!

The if-elseif-else construct is evaluated as a list of list of statements. The first statement is used as an unit-element in a foldr over this list. This foldr applies a given strategy to every element, which is to be expected. But it also performs an split-and-intersect of the dynamic-rule-set between every two elements. The result of this is that the rule-set will only contain the dynamic rules that are stable in all branches of the if-elseif-else statement. The implementation of the switch-construct was pretty easy after I made the infra-structure for this "split-and-intersect"-foldr.

So the constant propagation is coming along very fine and is also added to the php-sat-tool. The new flag --complex-inclusion triggers the tool to perform constant-propagation and include the files that can be included in that way. The constant propagation is not completely finished, to consider this an 'alpha-feature in this 'beta'-tool.

Not the only one

It is always nice to know that you belong to a group. I discovered this week that I am part of a large group of people that still works on their SoC project. It turns out that many people still try to work on their project, but at a slower pace. I am one of those people :) I am also one of the people that can walk around in a Summer of Code 2006 t-shirt (front,back), it arrived this week. But enough about bragging, let's talk about PHP-Sat again..

You will probably know already that Google has released Code Search, a search engine specifically for code. It turns out that it can be used to find vulnerabilities in PHP sources. The regexp for XSS due to echoing raw input, which is heavenly quoted on the Internet, gives around 7000 results. There are off course false positives in these results, but many of the results would be 'useful'. Could there be a better way to show that PHP-Sat is needed?

So I have been working on PHP-Sat and, more specifically, on the completion of the constant-propagation. I have rearranged the tests for the simple-evaluation. There used to be one big testsuite with over 450 tests on almost 3000 LOC. There are now separate testsuites for expressions, operators, primitives and so on.

Functionality is added in the sense that constants can now be defined and actually used. I have also begun with the implementation of references. The manual says that references are just pointers to the same memory location. I have mimicked this behavior by adding an extra step between a variable and its value. Every variable is rewritten to a 'variable-identifier', this 'variable-identifier' is rewritten to the actual value. Introducing a reference is now a simply introducing a rewrite from a variable to a 'variable-identifier'.

Oh, and I have also updated the implementation of pattern C002 to handle static function calls. Mmmm ... I should really put more time into those patterns.

New and Improved inclusion

This week was not only filled with going to the university and playing around with PHP-SAT, it was also filled with (a lot of) preparation for the Flitsjacht, a game for scouts in our region. Organizing a game that is played by over 150 people in 11 teams in an area of about 10 square kilometers takes a lot of time, but the photo's are priceless.

Organizing this game does not mean that there is no progress on PHP-SAT, revision 256 holds a few new things. It includes some new tests, a few extra lines of source code and a lot of new functionality! The basis for the '--complex-inclusion'-flag is finished with the evaluation of the internal functions 'include','include_once', 'require' and 'require_once'.

The implementation of this evaluation was not that easy because of a small detail in the strategy 'find-in-path'. This strategy takes a list of paths and tries to find a file in those paths with a given file name. It does this by checking the existence of each 'path + file name'. This would always return the full 'path + file name', the next paragraph explains why this is useful, and worked like a charm. Until I tried it with a file name that was available at the current-directory! I did not get back the absolute path, but the original given file name. It took me a while to figure out that the strategy first checks for a file with only the given file name without taking the paths into account. I solved this by just looping over the list with directories and now everything is fine.

But why do we need the absolute path to include files? This makes it a lot easier to implement the logic of the 'include_once' and 'require_once' functions. Before we include a file we check if the file was already included by using the absolute path. The absolute path is necessary to prevent false positives in this case. False negatives are prevented by normalizing every path after a file is found. This normalization removes all '.' and '..' steps in a path and returns the direct and absolute path.

Next step: evaluate some more internal functions, especially the 'define' function that defines constants.

I love my test suite

When I started to implement the evaluation of control-statements for the constant propagation I realized that I made a wrong decision. This occurred to me when I implemented the control-flow with a single if statement. One of the side-effects of such a statement is that it should undefine variables that are changed within the body when the condition can not be evaluated.
For example:
<?php
$foo = 'bar';
if($bar == 1){
$foo = 'foo';
}
//Value of $foo is unknown
?>
Because we do not know the value of $bar we do not know the value of $foo after the condition. So the dynamic rule that maps $foo to the string-value "bar" should be undefined. Stratego has special syntax for these kind of situations and this makes the implementation very easy.

When I wanted to do this for arrays I realized that arrays where implemented as a key-value mapping in a hashtable. This is correct, useful and it works, but undefining mappings takes a lot of work because the special notation would not work with this implementation. So I had to change this implementation and use dynamic rules.

My test suite really helped me in the refactoring of this implementation. It showed which part where updated correctly and I could use the tests to analyze why the rest still failed. It also made me aware of the fact the when a dynamic rule is undefined the key still exists. The application to this undefined key will always fail, as expected, but the key still remains in the set of all keys of a dynamic rule. The test suite also identified some hidden assumptions in the old implementation, which where immediately removed.

All this information would not have been available to me without the test suite. To quote a large company: "I'm loving it!"

And the answer is...

Did you have enjoy thinking about the problem? I definitely had fun during this weekend, pictures will be put on the website soon.

But back to the question off last friday, what will be the result? It would not have surprised me if PHP would raise a NOTICE, but this is not the case. It might have surprised me if PHP used the last statement in the definition as the result, but this is also not the case. The variable $result is just null, not initialized, nothing happens! This is probably not the intention of a programmer, either the function is misspelled or the function is not correctly implemented. So this pattern is added to the correctness-category and has id C005.

I have also worked on the constant-propagation again. There is now support for superglobals and the list-statement. The support for superglobals also implies that every variable is also accessible through the $GLOBALS array.

Getting through the weekend

My upcoming weekend will be filled with a lot of fun and excitement, I have another World Jamboree troop-weekend. Check out this site to see the other people of my troop.

For the people that can not spend a weekend without wondering about PHP I have a little exercise. What is the output of the following piece of code:
<?php 
function foo($param1, $param2){
$param1 + $param2;
}

$result = foo(1,2);

echo $result;

?>

Will it be an error, the result of the function or nothing at all? Don't spoil the fun by running it in PHP without thinking about it! The answer was a little surprising for me.

I proudly present: the logo

The PHP-Sat logo is made by Robert van Geenhuizen.
I am very grateful that he took the time to develop this logo. I think it is 'compleet hip', which means so much as 'very trendy'.

Robert made this logo with a concept in mind. The following quote describes this concept:

The creation of the PHP-Sat logo is based
upon the following:

Develop a conceptually strong logo that
uses modern illustration techniques to
make a simple, yet strong ideograph.
This results in a "Bug" stopped by a
imaginary (debug)-filter, the conceptual
base principal behind PHP-Sat.

The logo consists of many complex items
with gradient mesh and three tints, but
together they still form the basis for
this strong and modern ideograph.


The statement is very hard to translate, so this is the statement in Dutch. If anyone can provide me with a better translation, please let me know.

Bij de creatie van het PHP-SAT logo is
uitgegaan van het volgende:

Een conceptueel sterk logo neerzetten
dat door middel van moderne
illustratietechnieken een modern maar
toch simpel en sterk beeldmerk is.
Dit resulteert in een Bug die door een
denkbeeldig (debug)-filter vliegt, het
conceptuele basisprincipe achter PHP-SAT.

Het logo bestaat uit veel complexe items
met verloopnetten en 3 kleurtinten, maar
toch vormen zij samen de basis voor dit
sterke en aanwezige, moderne beeldmerk.


Here it comes:

php-sat logo

What is going on?

If you have been wondering about what is going on, here is the answer. I have attended a conference of the NLUUG last Thursday. I actually worked there to check badges, but I could also attend the talks that where given. The main reason for being at the conference was the change to talk to people that would probably be interested in PHP-sat. I gave a demo to some of the partners of madison-gurkha, which went well I think. They where introduced to me by Armijn Hemel, who has a brother (Tim) that works there. I already presented the tool to Tim last Monday and he seemed to like it too. He is even learning Stratego to be able to adapt the tool.

Armijn also helped me out with some typo's in the documentation and with testing the tool. He found some things that where not support by the SDF. Some of them where rather easy to fix, others require an update of Stratego or a post-processor for the parsing.

If you have checked the issue tracker recently you might have noticed things are a bit more organized now. I have added some milestones to be able to plan more. So you can check out the roadmap to see what is going on.

Another thing that is going on is the creation of a paper. I have to write a paper for the STC and Martin suggested that it would be nice to write for a real conference as well. I will try to get it published so that the project will get an even more solid foundation. But I will certainly talk about the project at the STC and maybe also at the Stratego User Days.

To conclude this entry with a cliffhanger, the logo for the project should be ready very soon!

Anything new?

Sorry for the late update, college is really getting started so I have to get used to going to lectures and meetings again. But my php-*-projects are still going strong. The evaluation strategy can handle construction of, assigning to and reading entries from arrays now. An array in the interpreter is modeled as a map which maps keys to values. Some useful strategies have been implemented to easily add and retrieve values from the arrays. The documentation about arrays tells us that PHP also sees an array as a map, so the implementation of the semantics was not that hard. The only thing that it does not support (yet) is references. These will be added later.

The second thing that has been added is the fix for the long lasting psat-2 bug-entry. This states that HereDoc should be parsed as a sequence of literals and escapes, just like double-quoted strings. This turns out to be very tricky. Double-quoted strings have a very straight-forward start and end-point, HereDoc does not have this. The end of a HereDoc is marked with a newline-label-newline sequence. So the newline was added as an escape in a literal, we have to treat this newline in a special way if it is followed by a label. The problem that arises here is that we cannot express this within SDF. This should be written down as a follow-restriction on the newline, but then we would have to know the length of the label. This length is not fixed, so we can not do this. By not setting the follow restriction a literal became ambiguous when there was a variable in front of it. The solution here was to write out the definition of 'a list of literals or escapes'. This definition was extended by 'where there are no two literals in a row'. This solved all the problems for that moment and everybody was happy!
This happiness lasted until I tried to parse a file with a statement after the HereDoc, this failed. So a statement after a HereDoc caused the parser to keep parsing until the end of the file, something that we definitely do not want. So the problem here was that the end of a HereDoc was not strict enough. I tried several things, but nothing seemed to work. The optional semicolon was causing big problems. It turns out that this optional semicolon was a misinterpretation of the documentation. The documentation mentions this semicolon and I thought that it was part of the end of the HereDoc. But this semicolon is only allowed because the HereDoc is probably part of a bigger expression that needs to be closed by the semicolon. So I simplified the HereDoc-end and this allowed a nice follow-restriction. All the problems are over and everybody is happy again :)
So there is now a more expressive support for HereDoc, with a restriction. This was already true with the first implementation, but never really mentioned. The problem is that one can not express the fact that the labels of the HereDoc should match within SDF. So HereDoc is parsed from the first open label until the last closing label, even if there are statements in between. This restriction is unfortunate, but reality for all parsers that are based on SDF. It is simply not possible to solve this within the syntax definition. It can be solved by adding a post-processing step, so an issue for this is already created.

The finish this entry with some good news, there are windows-builds for both php-front and php-sat! It took us some time and some hackery implementation of stratego-routines to get everything right, but we succeeded. The hackery stuff will be moved to the stratego libraries as soon as possible. I have tested the build on my own machine and it seems to work fine. Please try it out if you have a windows machine that you can (ab)use. Windows-builds for stratego programs are relatively new, so there could be some hidden problems.

Finding differences

Here is another example that shows that testing is useful. I had implemented the type-juggling from Integer- to String-value and written some tests for this. All the tests succeeded on my machine so I could commit. I was surprised when I got a mail about a failing build. If you look at the page then you will notice that the check passed for i686-linux, but failed for i686-darwin. The tests that fail have in common that they handle a non-empty string that does not start with a number. Thus my code applied string-to-int to an empty string. This strategy is implemented in C and calls strtol. It turns out that the result of this function is not the same for both platforms if the input is an empty-string. I am not a C-expert, but i686-darwin seems to give an error and i686-linux does not do this. The problem is solved and tests are added to the Stratego-libraries for this. But it might be useful to know for others.

These implementation details might be interesting for some, others prefer an update. So what has been added to PHP-SAT? It is now possible to include files in a simple way. This means that all files are included without looking at the context. They can only be included if the file names are coded in Strings, so no concatenation. They should also be on the include-path. But this is the same for PHP.

It is also possible to print the included files. All files are printed to a file with the same file name and a post-fix '.psat'. The bug patterns within PHP-SAT are also applied to the included files, so these are checked automatically.

I am currently improving the simple evaluation to be able to handle more complex file inclusion. It should also improve file-inclusion by having the normal semantics for include_- and require_once.

The last cool thing is that almost all calls to the stratego-xtc-library are gone. This means that we could make windows-builds soon!

Thanks and progress

I think that there are many people with great ideas for projects. Most people do not get around to actually starting up these projects. It takes a lot of time which is usually not available. Without the Summer of Code I would not have had the time to start the project, let alone work on it for so many hours. Thank you Google!

The project would not be in his current state without my mentor. He helped me in setting up the development environment and automating the build- and test-process. Our meetings motivated me and helped me in keeping focused. I would recommend all (future) participants of the Summer of Code to meet with his/her mentor face-to-face, or at least in some interactive way. It helped me a lot, thank you Martin!

The following section is taken from my evaluation for the Summer of Code. I think it nicely summarizes the current progress of the project. The project has produced the following two libraries, together with tools to interface with them.

PHP-Front is a library with support for parsing and pretty-printing php, reflection of parsed sources, some generic traversals and a simple evaluation. This part is available as a separate package which provides a solid basis for transforming or inspecting PHP source code. I think that there will be more projects that are going to use this package for this purpose. One of the projects that I am already aware of is StringBorg.

PHP-SAT is the library that actually performs an analysis on the given source code. It tries to detect 7 bug patterns, more will be implemented later. It also check pre-conditions for functions and language construct to detect possible vulnerabilities. This last analysis will be improved over time.

Introducing eval-php

After spending a weekend in the woods, hearing a lot of information about the World Jamboree, I dived into the operators of PHP. Some of them could already be evaluated, but I wanted to add support for other operators as well. I have made a tool to check the evaluation and named it 'eval-php'. This tool is capable of evaluation a subset of PHP. It cannot handle control-flow (yet?), but it can support several operators on the integer, string, null, Boolean and float type. The following list of operators are supported: Small note about the arithmetic operator 'mod'. It has no support for floats until issue STR-630 is fixed.

This also means that the bitwise operators, error control operators and execution operators are not supported at this moment. But as said before, it supports a subset. Supporting the Error Control operator when there are no errors generated is a bit strange anyway.

It is fun to be able to evaluate thing, but it is even more fun to see things that are evaluated. So the tool produces evaluated output. This means that everything that is not within PHP-tags is outputted. There is also support for the 'echo' statement, so this produces output too. I have put a small test-file online here. The output of eval-php is available here, it can be compared with the real output of PHP itself here, take a look at the source of the page.
I think it is all pretty fancy :)

Working on evaluation

I have been working on evaluation the past few days. Not the evaluation of the SoC, this is something I am going to do that next Monday, but that of PHP. The reason for this evaluation can be found in the first blog of Monday.

There is currently support for arithmetic operators on the following types: integer, float, Boolean, null and string. Types can also be juggled to an other type. It took some time to get this juggling right. Especially the juggling from String to numbers. The string is now sort of parsed within Stratego to get the part that can be recognized as a integer / float. But it works :)

Speeding things up

There has been some major performance boost implemented today. The build and tests of the package took about 7 hours in the build farm, which is a lot. Even for Stratego standards and even if it is build on 3 different platforms. The problem was the processing of the test-input. A lot of small inputs are processed by writing it to a file, parsing the file and reading the result. This involves a lot of IO actions. So the inputs are not written to files anymore. This helps a little, but the major boost came with caching the parse tables. The parse-table is now opened once for every process instead of every test / input.

Splitting up

The project holds two libraries in it's source. The first library is php-front and holds a parser, pretty-printer, a reflection part and a set of common strategies. The other library holds the strategies that are used within PSAT. The PSAT-part depends on php-front. The base library can be useful for more libraries and will be branched from psat soon. It will be a separate package called, how surprising, php-front. It will be available here. You can also find a link there to the page that will be the home for php-sat. They will both be filled soon.
Note that the name has slightly changed. This is to prevent confusion with possible tools that will statically analyze Perl, Phyton, Pascal, PL or Prolog. The other reason is that psat.org was already taken. All the names will hopefully be updated as soon as the packages are separated from each other.

Project deadline

The project deadline is today at 17.00 hours, at least here in Holland. This means that the project is due in a few minutes. My mentor needs to fill in the evaluation based on the code that I have now.

This deadline does not mean that the project will stop. I am satisfied with my results up until now, but I will continue to improve the code and the tool. A more elaborate report about the SoC and the progress of the project will probably follow in a few days.

Including files and hidden tasks

File inclusion within PHP can really work on your nerves. I can remember an occasion when it took me about an hour to find out why a file was not included. This was because PHP first looks in the working-directory and then in the current directory. If there is a file with the same name in both directories the first one is included. If I had taken a look at the documentation then this would not have come as a surprise to me.

Way this paragraph about including files? Because this feature is added to php-front. There is now a strategy available that uses the 'require' and 'include' statements within a PHP-file to find and parse other PHP-files. These PHP-files are added to the environment and can be accessed anywhere in the traversal.

The implementation of this inclusion strategy included the modeling of the different paths that PHP used. As mentioned above, PHP uses two different directories to search for files. The working directory, where the process was initiated, and the current directory, where the current file resides. These directories are the same if there is only one file to be processed.

I did not want to mess with the current-directory of the tool when including files, so the paths are kept internally. This requires some string manipulation to make up the right include-path, but it works. By keeping track of the include-path and working-directory within the application there is even support for the functions ini-set and chdir!

All this could not have been done without moving things around and adding some small but useful strategies. This included: a separate section is added to the library that holds common strategies, the parsing is moved to the library and the environment can now store complete AST's. All of these things are small and easily done. But unfortunately, many small things take up a lot of time.

The mechanism of including files can be improved by adding partial evaluation of PHP-code to the library. This will add annotations with known values to terms. By using this, and some constant-propagation, the following will succeed:
<?php
$path = './some/path/';

require_once $path.'filename.php';
?>
This is commonly used within projects so this should be supported. So on to the simple evaluation and constant-propagation!

One week later

I flew back from Spain yesterday evening, no problems with my luggage, and spend this day getting everything organized again. I had planned on going to Lowlands this weekend, but I have to skip it this year. But I will join the party again next year I hope.

But what have we done the past week. I have added a few patterns to the code:
The last one needs some work though, the rest of the function-names should be added to the list.

I have also spend some time on improving the pretty-printer. I have added unit-tests for the different parts (expressions, statements, documents) and made sure that things are consistent. An example of this is that every construct that has a list of expressions is printed in the same way. The expressions in the list are separated with a comma and a space. A small example:
 //used to be:
array($foo,$bar,$fred);

//is now
array($foo, $bar, $fred);

I have also made the strategy that extracts the safety-type build a default value of 'Safe()'. This makes the application more conservative because if it has no knowledge about a variable it will consider it to be safe. This should prevent a lot of false positives.

The last item of interest is a small document with some considerations about constant propagation and including files. The response to the user should show the code that the user has entered. So constant propagation should not be done on the current ATerm, but the information should be added to it. The only thing I could come up with is to add the information in an annotation. But if you have a better idea don't hesitate to comment.

Relaxed working

It is not easy to get some work done while there is a private pool nearby, but I managed to get some things done. I have added the syntactic rules that make sure that an end-tag ends a line-comment. This involved another rule that recognizes nothing as a constructor. The follow-restrictions are really important in this case, but I think I managed to get them right.

I have also added the right syntax for backticks. This syntax is quit similar to the syntax for double-quoted strings. So the productions that make up the double-quoted strings could be reused to make the syntax for backticks.

A small note

I have added two more patterns to the correctness category. They have the codes C002 and C003. The last one is also mentioned in the PHP manual, but it can't hurt to point this out to the programmer. The other thing that I have added is the test-suite-file for the tests of phase 4. This suite contains only two tests for now, but more will be added during my vacation.
"Until next time, take care of yourself and each other."

working on other things

I have done interesting things today, but they did not all involve psat. What I did for psat was the adding of a strategy "analyse-safetylevel" to replace the "annotate-sources" strategy. This strategy is set up in a different way to make it easier to add support for other constructs.

What I have also done for psat is changing the way safety types are combined. My first thought was to combine safety types if they have the same level. Variables can then have the safety types "escaped-html" and "escaped-slashes" at the same time. This is something one would expect.
However, combining safety types of other levels would results in variables that can have both the safety types "formatted-string" and "encoded-string", or even "object-type" and "integer-type"! This is definitely not desirable. So the only types that can be combined from now on are the types that are on the "escaped-something"-level. Writing this makes me realize that I will have to add some more terminology.

You might want to know what other activities I did today. If you are not interested you might want to skip this paragraph. I have had my first experience with compiling and installing PHP on my linux machine, a nice little laptop. Quit nice, but the real cool activity was a pair-programming session with my mentor. The commit-message of the result can be found here. And the module with my name in it is located over here. Always nice to have your name in a software package that you are using a lot.

I have also finished setting up my laptop as a development environment. This had to be done because I will be flying to Spain on Saturday. One and a half week of nice weather, a private pool and a lot of relaxing. I will be working on some things during my vacation, so don't panic :)

straight to phase 4

The tool has upgraded from phase 2 to phase 4. This is the current phase that needs to be implemented. I will give a short sketch of what the tool supports at this moment, besides a hand-full of common patterns.

The tool is configured with a text-file of which some sections where explained yesterday. A default configuration file can be found here. This is definitely not complete, but it already produces useful results.
Two features are added to the configuration file. The first feature makes it possible to define a precondition for the language-constructs 'echo', 'die', 'exit' and 'print'. The syntax for this is:
 construct: construct-name (  precondition  )
The second feature is the possibility to define functions with a default level. This can be done in the '[function result]' section. The functions that are specified there are assumed to always return values with the specified safety-type. This is not limited to build-in functions, user-defined functions can also be assigned a default safety-type result.

But which results are produced? The following example shows two things that are supported right now:
< ?php

echo "hello ", $_GET['name']; //is flagged

print $_POST['param']; //is flagged
?>

Keep in mind that the results depend on the configuration. These results will appear when the default configuration is used. A more precise configuration will give more precise results.

From phase one into phase two

Major leap today. The implementation of phase one is ready. This phase consisted of parsing a configure-file and using this configuration to identify tainted sources. A typical configuration file for tainted sources will look like this:
 [tainted sources]
array: _GET
array: _POST
function: file_get_contents

But then there will be a lot more entries off course.
The configuration file is used because people should be able to easily tweak the tool. This is why I have also included the option to specify a certain level for each source. A little example:
 [tainted sources]
array: _GET level: escaped-slashes,escaped-shell
This means that the variables coming from the $_GET-array return values in which both slashes and shell-characters are escaped. These safety-levels are used in the static analysis.

That was phase one, on to phase two. This is also a configuration file-issue, namely the identification of sensitive sinks. They can be specified in almost the same way as tainted-sources.
[sensitive sinks]
function: foo ( safe )
function: bar ( int-type, esc-h )
This configuration file means that the function "foo" expects one argument and this argument should have the safety-type "safe". Function "bar" has two parameters which should be of "integer-type" and "escaped-html" level. For each parameter the safety-type should be specified. Parameters in which the safety-type does not matter can be assigned the type 'unsafe'.
One can also specify that a certain parameter needs to have one of more types, or it must have multiple types.
[sensitive sinks]
function: foo ( esc-h || esc-sh , esc-h && esc-sh)
This means that the first parameter needs to have either the type escaped-html, or the type escaped-shell. The second parameter needs to have both types, and thus levels, of safety.
There are two things to consider when writing such a configuration file. The first one is that it does not allow functions without parameters. This is not that strange because what is the use of a sink in which nothing can be thrown? The other thing is that the and- and or-operator should be used with care. Since the level that are specified represents the minimal level of safety, the safety-types that are used in operators should be of the same level. The following represents something that must be both safe and unsafe. Which can only be true when the variable given has type 'safe'.
[sensitive sinks]
function: foo ( safe && unsafe )


Tomorrow I will be adding some more configuration details for the sensitive sinks. Some language constructs must be configurable and it does not take very long to add this. I will also be working on checking the safety-types of parameters given to sensitive sinks. So the first useful results should appear soon.

Reflection support

Sometimes you have to reflect on something, and it is now even possible in php-front. The inspection of functions during a traversal was already possible, but the first implementation also retrieved all the functions within classes. Since these functions are not in the global scope this could result in some serious errors. So this is fixed and support is added for classes and interfaces.

To be able to do reflection in a analysis you have to create an environment and initialize it so that it contains all the functions. This is off course captured in some strategies so it is very easy. The only thing that you have to do is make a choice between an environment for PHP4 or PHP5. The grammar is split up in this way and I found it more cleaner to also separate the
environment. This is also use full when the internal functions are added to the environments. PHP5 has a few more of those.

The reflection library is constructed in the same way as the dryad-library. It is actually set-up in a OO-way. So there is a notion of an PHPEnvironment, which is abstract, and a PHPFunction, which is also abstract. Each version has his own kind of environment and function which makes it easier to reason about.

The only problem is the implementation of all this. It requires very careful thinking and precise notation. It is very easy to confuse some strategies and this can result in some long debug sessions. But luckily each part of the reflection is carefully tested, but that is normal right ;)

Results from the meeting

So this afternoon I talked to my mentor. Several things where discussed and all of them where positive. The structure of the code for the project was good and organized. Only some miner points about the formatting of the code. Really not a bad result. He also has a cool idea about the organization of the patterns. We might be able to organize it in such a way that patterns can be plugged in at will, which will really boost the extensibility of the tool. But we will have a look at that later.

Then there was an issue of the type-states. After the response of Christian yesterday I have come up with a set of safety levels. Each variable in the program will be assigned such a safety level. When a computation is made the result will get a level of safety which is normally the minimum of the levels of the variables involved. The sensitive sinks will get a precondition for each parameter. This precondition describes the minimal level of safety needed for that parameter. When a variable does not meet the precondition the function will be flagged by the tool. Simple isn't it :)
Since the preconditions will not be equal for all applications, and some functions can be trusted sometimes, the configuration regarding sensitive sinks and tainted-data sources needs to be configurable. I will be writing a small syntax for three ini-files that allow everyone to tweak the application.

The last thing that we did was making a start with the reflection-library. When traversing the tree of a program it is use full to have access to the functions and classes that are defined. Since some of the functions are defined by the user there should be a way to access the implementation of the functions. This can be done by traversing the tree every time a function is needed. You can imagine that this is expensive. So the trick is to traverse the tree once and build up hash-tables for things that one wants to access frequently.
So it is now possible to get the AST of a function by it's name. So when a function call is made the implementation of the called function can be retrieved on the spot. I will have to add support for classes and some more strategies to get interesting properties of the AST's, but I now know how to do it. A lot of new and fun stuff to do!

Adding patterns

Today I have added the documentation of eleven bug patterns: C001, O000, O001, O002, O003, S001, S002, S003, S004, S005 and S006. You can find the description of each of the patterns here. Remember that each category has it's own prefix, so you will have to look within the directories.

I have also added the tests and implementation for patterns C001, O000 and O001. The patterns that are not implemented yet have [Not implemented] in their description. They will be added (hopefully soon).

Their was also some disappointment today. I received a reply from Christian who I had mailed yesterday. They can not share the research that they have done regarding the list with preconditions and type states. The algorithm has been put in the products of Armorize Technologies. Sharing the crucial part of your products is not a good idea for a commercial company it seems :)

But he gave me some pointers on how to start, so that is what I will be doing tomorrow morning. I will be having a meeting with my mentor about the set-up off the tool, but I think it is okay.

Testing 3 4

I have added support for SUnit tests to the project today. This also includes some tests for the two patterns that where already in the tool. I have also added tests and the implementation for a style-pattern. This one has the code, you can probably guess it, S000.

It took me some time to get all the tests working and the compilation of the tools is not that fast. So I have the feeling that I am not making that much progress the last few days. But the adding of patterns an such should go a lot faster now because there is a method to do it. So I can focus on the implementation of the patterns instead of how to add them.

I have also send an email to the authors of the paper I mentioned yesterday. I hope to hear from them soon because I am reluctant to start the implementation of the phases. They should have a lot of use full information which I can use. I will wait until Friday and otherwise I will just have to start with the phases. So I will be adding patterns to the tool as fast as possible until Friday :)

A long one

After a whole day of doing research I just had to do some coding. So that is what I did today. I have added the basic stuff to add 'bug patterns' to psat. This should not have taken long because the detection of these patterns can be very simple. So I could have done this by simply dumping the rule in a file be done with it. The rule is certainly put into a file, but I have also added a directory structure within the library to make sure that the patterns stay organised.

As my mentor mentioned, the patterns should be organised. It is natural to put each pattern into a certain category. The following categories seem to be use full to start with:
  1. Style
  2. Correctness
  3. Performance
  4. Malicious code vulnerability
  5. Information leak
The first three are straightforward. The fourth category is mostly dedicated to the initial goal of PSAT. The last category will hold bug patterns that can expose data which is not intended for normal users and should not be shown in a production environment. Each category will get his own directory and each pattern will get his own file. Each pattern will also get an unique code which holds the category and the follow number.

The first bug pattern to be in PSAT is using the If-construct to check the existence of a variable. This pattern falls into the category of Correctness and therefore has the code C000. I know it is a bit optimistic, reserving space for 999 bug patterns in one category, but you can never be sure :)

I also wanted to write something about the categories of the functions. While I was reviewing them I thought about how different functions need different escapes. A query can hold HMTL-characters but needs to have escapes for all the quotes to prevent SQL-injection. Data that is send to the user should escape slashes, but also HTML-tags. So I will have to go over the collected functions and give each sensitive sink a certain level of safety needed. This is the same approach as the one taken in [1] and it seems to work there. They have implemented this in a tool called WebSSARI, but the site of this tool seems to be deleted. So I will try to get in contact with the authors of the tool, but it probably means a few hours of research.

1: Yao-Wen Huang, Fang Yu, Christian Hang, Chung-Hung Tsai, D. T. Lee, and Sy-Yen Kuo. Verifying web applications using bounded model checking. In DSN, 2004.

Doing research

I have to do some research before I can start with the different phases of PSAT. Some of that research is relaxing and fun to do, an other part is quit annoying. Let's start with the first part. Since I want to add the possibility of providing feedback about code smells I have to find code smells to give feedback about. Besides the topic on GoT I start reading up on my subscription to PHPArchitect. I still had to read four issues, now two. You might wonder why this is fun to do. Well, I have a balcony with just enough space for a hammock and I have printed the issues. You can figure out the rest yourself.

But there was also a more annoying part. I have to divide the internal functions of PHP into three major categories:
  1. Functions that can return tainted data
  2. Functions that can untaint data
  3. Functions that are sensitive sinks
See the Terminology section of the Phase description for some explanation about the terms.
When I was going over the list I also made a list of functions of which the information should not go to the user. Such as functions that retrieve all kind of information about the system.

It is interesting to check out which functions PHP has, but it becomes less interesting when there are over 3500(!) internal functions. So I spend my day was with reading function-descriptions, but I have at least seen all functions. I will explain some more things about the categories tomorrow. I will also add some stuff to the tool that will make it actually use full :)

And what have we done today?

This morning I worked on the actual psat-tool. It parses a file (using parse-php), applies a strategy to the ATerm and pretty-prints it (using pp-php). The strategy it now applies is:
strategies
dangerous-variables = topdown(try(matchthing))

rules
matchthing: Simple(_) -> Simple("foo")

So what does this do? It matches on a Simple-node and gives it the name 'foo'. This means that every variable ('$name') is renamed to '$foo'. Not really use full, actually quit bad, but it shows the power of what can be done. Two actual lines of code and all variables are renamed.

This rule will be deleted in the next commit because it is not use full. But what is use full then? I have asked the people at GoT and gotten some reactions. Still hoping for more feedback off course.

The other thing I worked on today was the specification of the phases. I had a plan two posts back but I rewrote it a bit after some feedback of my mentor. The result can be read here.

The last thing I did today, besides writing this entry, was making the pretty-printer simpler. I had a strategy that rewrote a list of statements, but the framework could do this for me. The cleaned up about 30 lines of code, which is always nice :)

Adjusting the plan

Today was filled with some bug fixing and new ideas. I have fixed a few small things in the pretty-printer that caused the failure of test-cases test-cases. I admit, I was a bit optimistic yesterday. The results of the round-trip tests is looking very cool now, if you want to see a lot of sunglasses I suggest you take a look.

The new idea came from martin in his reply to the phases-description (which I will rewrite/improve tomorrow). The idea was to add other indicators for bugs to psat. Just simple things that can help developers to improve their code. One example of this is the usage of a variable that is not declared. This can great problems if register globals is on. Another example comes from the PEAR-coding standard for including files. Adding these use full remarks would turn PSAT into a PHP-equivalent of FindBugs, which has a long list with descriptions. Some of them also use full for PHP.

But I could not find lists of specific code smells for PHP. There are some general code smells that can be used, but I think there are more things specifically for PHP. So if you have any ideas about what a PHP-programmer should not do, or tricks that can improve the performance of PHP-scripts, please let me know.

Pretty problems

Today was a wonderful and, for Dutch standards very, hot day. The fan in my room was pointing at my computer instead of me to keep the temperature of the processor at a decent level. So what has been done in this heat for psat? The pretty printer is somewhat finished! If you take a look at the parse-results you will see a lot of cool smilies. There are just some minor problems to be solved. A trailing slash at the end of inlineHtml and some diff that do not go as planned. Some of the constructors could also be printed prettier, but this is something that can always be done.

I decided to split the pretty-printer into three different tools. Just as the parser is, so one where you can specify the release and two tools for the specific releases. Consistency is always nice. I have also added constructors and pretty-print rules for the alternative syntax. The reason for this is that the same constructor could have children of different formats. This introduces if-then-clauses in the rewrite rules which are not really intuitive. With specific constructors I could also print the alternative syntax in the same way.

The next phase(s)

The round trip has been added to the test-script. The new smileys should be popping up soon in the parse-results page.

Since the pretty printing is coming along fine, I considered the phases for PSAT itself. I have also posted this to the psat-dev list. But this could also be discussed here off course. Any comments or suggestions are most welcome.

Phase 1:
Identify and annotate the variables that come from the user with 'un-safe'
This includes the research of which variables can be altered by the
user. Some info

Phase 2:
Identify and annotate the functions that can cause vulnerabilities
to occur if a 'un-safe' variable is used.

Phase 3:
Generate a rapport in some form to show the user the possible vulnerability
At the end of this phase the following should give use full info:
 echo $_GET['name'];    
The design of the rapport is also considered in this phase.

Phase 4:
Identify and annotate the functions that make variables safe for use
within the 'un-safe' functions. The annotation should also be added to the variables used and propagated.

Phase 5:
Add support for assignment variables and the (simple) propagation thereof

Phase 6:
Add support for reference variables and the (simple) propagation thereof

Phase 7:
Add support for the propagation within and coming from(user-defined) functions

Phase 8:
Add support for the propagation within and coming from objects

Notice that the phases are build upon each other. The base-foundation
is made in the first four phases, the simplest case. The other phases
each add a layer of complexity. Each phase should result in a working
tool that supports the mentioned constructions.

Back from the woods

Great wheat er, great terrain and nobody has become ill from my cooking, a successful trip. Now that I am back the fun is starting again. The pretty-printer can parse all literals such as variables, strings and operators. There is some support for expressions, but more will be added tomorrow.

The test-script will also be expanded with a pretty-print,parse,pretty-print and diff round trip. The difference between the original pretty-printed and the parsed-pretty-printed version should be totally empty off course. This will show the 'correctness' of the pretty-printer. If everything goes according to plan then this will be also be finished tomorrow. The script that is...

Tools!

What is the thing that every programmer wants? Tools off course! The current revision 74 has a few tools that let you parse php-files to the ATerm-format. 'parse-php' is a common tool with a option to choose the release you want. For everyone who wants to have a separate tool for each version we have 'parse-php4' and 'parse-php5'.

The other tool that is available is 'pp-php'. This tool will pretty-print a php-file in ATerm format to an actual PHP-file. This will also be tested with the distribution files, but it is not yet finished. It supports variables and operators. The rest will be supported as soon as I come back. Yes I'm going away for a couple of days to cook for a group of girl-scouts. So no ATerms, SDF or command-line tools for a few days.

Version 5 support

Well, that did not seem very hard. There where some minor problems with the new features in version 5, but as of revision 63 the files from the distribution of PHP 5 are also parsed correctly. There are no ambiguities and the SDF passes almost 500 unit-tests. If you find anything that should be parsed and fails with the SDF, please let me know.

For the people that are interested, a (short?) list of differences between version 4 and 5, it lists what the version has in contrary to the other version. This only includes the syntactic differences, not the semantic differences.
  • Version 4 has old_function support
  • Version 5 can use results from function as if it was an object
  • Version 5 has __METHOD__ as magic constant
  • Version 5 can access class constants with '::'
  • Version 5 has 'clone' and 'instanceof'
  • Version 5 support references within for each-variables
  • Version 5 has support for final and abstract classes
  • Version 5 has interface support
  • Version 5 has support different types of functions within classes (e.g. private,public,etc)
  • Version 5 has constant class variables
  • Version 5 has try-catch and throw supports (no 'throws' for functions though)
  • Version 5 has type hinting for functions