I noticed that it was already a week ago since I wrote my last blog, time flies when one is having fun. So this week should go really fast because all sorts of interesting things are going to happen!
The first interesting event is going to be the Stratego User Days where I will be presenting PHP-Sat. This presentation will be a bit more technical then my STC-presentation, but it will also contain the more theoretical stuff (I think).
The other even that is interesting, certainly for me, is that I am going to Dublin next Monday. I will be staying there for three days and I am going to do at least two interesting things there. I am part of the GSoc-group that is going to visit the Google Office. Someone else who was going to come was Paul Biggar, one of the people behind the Phc. It is to bad that I cannot meet him to talk about some of the problems/choices/gizmos involved in parsing and transforming PHP-code. But I will be meeting the other two authors of Phc. I am really looking forward to this meeting, it's going to be very interesting.
I promise that I will write about all of these events, so don't worry about missing all this interesting stuff. I am even considering to subscribe to something like Flickr so that I can share some pictures with you, although I think that the SUD will be covered by Eelco Visser.
For those people that want to read about something that actually happened already I have a little Doh-anecdote. I spend a day working on PHP-Sat together with Martin to fix the last DoubleQuoted string-ambiguities. The problem was that there where some literals that kept breaking apart within strings that where used for regular expressions. The reason for this was a missing follow restriction, so we fixed it and where very happy with ourselves. So I took another look at the problem yesterday and noticed that there where still more situations that showed this behavior. After about an hour I realized that we already fixed this problem for HereDoc by writing out the allowed order of literals and escapes. But I can not remember why we didn't do this for DoubleQuoted strings, isn't that interesting?
Who's idea is it anyway?
I started with the research for my thesis proposal this week. For those who
wonder about the subject: please read on. For everybody else: skip to the next paragraph. My master thesis will explore an idea of Johan Jeuring
and Harrie Passier. The eventual goal is to provide feedback in educational tools (well actually educational tools that allow you to rewrite a certain input step by step to an answer), as if the feedback comes from a teacher. One of the hurdles that has to be taken is the guessing of which step the student wanted to take when a faulty answer is inserted in the program.
The idea for guessing this step is that you take the last correct tree and then generates all the trees that can be obtained by rewrite-rules that are allowed. You then calculate the difference between these trees and the faulty answer to find out which tree is nearest to the faulty answer. The rewrite-rule used to get this nearest tree is probably the rule that the student wanted to apply.
This simple and short explanation does not expose all the problems very well, but I will probably come back to that in a later blog. For more information about the idea you can take a look at this paper.
Last Monday I was in Leusden to meet the people from 'TeamInternet' again. Apart from going through some new code of the scoutshop, I also showed PHP-Sat to the people there. Bjarni van Berkum came up with a new idea for a bugpattern and this resulted in another idea. He told me that he was interested in the results of PHP-Sat on the code base because it would probably find the first pattern a few times. I told him that it shouldn't be hard to implement, but it turns out that this is not completely true.
I was right about the fact that it is easy to implement, for normal and static function calls that is. The retrieving of these functions are pretty straightforward because they are directly available in the environment. The problems lays in the fact that most of the codebase of TI is ObjectOriented, which means that there are lots of objects passed to lots of functions. But we currently do not descent into functions, nor do we know about objects. This poses somewhat of a problem if you want to find out which objects are used within functions. So unfortunately, there is a lot of work to be done in order to support this.
But there where more ideas that came from people at TI. Frits Zwegers talked about an idea for a tool that can give an overview of which files are needed for a given file. This can be done by collecting all directly included files and all the files that declare a class or a function that is used within the file. The implementation of this tool is also delayed because of the problem mentioned above, but I am pretty sure that it will be available some day :)
The people mentioned above already came up with great ideas, so if you also have a great idea: share it!
wonder about the subject: please read on. For everybody else: skip to the next paragraph. My master thesis will explore an idea of Johan Jeuring
and Harrie Passier. The eventual goal is to provide feedback in educational tools (well actually educational tools that allow you to rewrite a certain input step by step to an answer), as if the feedback comes from a teacher. One of the hurdles that has to be taken is the guessing of which step the student wanted to take when a faulty answer is inserted in the program.
The idea for guessing this step is that you take the last correct tree and then generates all the trees that can be obtained by rewrite-rules that are allowed. You then calculate the difference between these trees and the faulty answer to find out which tree is nearest to the faulty answer. The rewrite-rule used to get this nearest tree is probably the rule that the student wanted to apply.
This simple and short explanation does not expose all the problems very well, but I will probably come back to that in a later blog. For more information about the idea you can take a look at this paper.
Last Monday I was in Leusden to meet the people from 'TeamInternet' again. Apart from going through some new code of the scoutshop, I also showed PHP-Sat to the people there. Bjarni van Berkum came up with a new idea for a bugpattern and this resulted in another idea. He told me that he was interested in the results of PHP-Sat on the code base because it would probably find the first pattern a few times. I told him that it shouldn't be hard to implement, but it turns out that this is not completely true.
I was right about the fact that it is easy to implement, for normal and static function calls that is. The retrieving of these functions are pretty straightforward because they are directly available in the environment. The problems lays in the fact that most of the codebase of TI is ObjectOriented, which means that there are lots of objects passed to lots of functions. But we currently do not descent into functions, nor do we know about objects. This poses somewhat of a problem if you want to find out which objects are used within functions. So unfortunately, there is a lot of work to be done in order to support this.
But there where more ideas that came from people at TI. Frits Zwegers talked about an idea for a tool that can give an overview of which files are needed for a given file. This can be done by collecting all directly included files and all the files that declare a class or a function that is used within the file. The implementation of this tool is also delayed because of the problem mentioned above, but I am pretty sure that it will be available some day :)
The people mentioned above already came up with great ideas, so if you also have a great idea: share it!
The first presentation
The last three days where filled with only 1 thought:
Prepare Presentation
Every student that follows the master program at the Center For Software Technology has to give a talk as part of the Software Technology Colloquium. Students usually pick a topic that is researched by other people and present projects, ideas or tools that are the result of this research.
Since the ideas and concepts of PHP-Sat could be interesting for others as well, and since I already know some things about it, I started working on the slides and preparing the presentation. On Tuesday I practiced the talk and got some pointers on how to improve the structure of the talk. I gave the presentation today and it was actually fun to do. I was a bit nervous before I started, but after a few slides the story just came out very smoothly. So the first presentation of PHP-Sat was a success, on to the next one?
(For those who are interested, the announcement and the slides can be found here)
Prepare Presentation
Every student that follows the master program at the Center For Software Technology has to give a talk as part of the Software Technology Colloquium. Students usually pick a topic that is researched by other people and present projects, ideas or tools that are the result of this research.
Since the ideas and concepts of PHP-Sat could be interesting for others as well, and since I already know some things about it, I started working on the slides and preparing the presentation. On Tuesday I practiced the talk and got some pointers on how to improve the structure of the talk. I gave the presentation today and it was actually fun to do. I was a bit nervous before I started, but after a few slides the story just came out very smoothly. So the first presentation of PHP-Sat was a success, on to the next one?
(For those who are interested, the announcement and the slides can be found here)
Something to celebrate
This is the 50th post in this blog, a number to celebrate because it is a nice round number. It is also the first blog-entry that contains some real-world results, another thing to celebrate. It also comes at the time that I am reading the book The best software writing, thank you Michiel, which might help me to improve my blog-entries. Something that is definitely worth celebrating.
So let's look at some results. I found out that there 91 detections of pattern O000 in the PhpDocumentor-project. Since this suggests that you should always pre-calculate the size of an array I decided to do a little test. I first used the original sources of PhpDocumentor to extract the documentation of 'phorum', 'phpBB2' and 'phpMyAdmin'. After this I modified all the for-loops of PhpDocumentor that where flagged by PHP-Sat and extracted the documentation again.
The results where rather disappointing, maybe a single second somewhere, but nothing really substantial. When I showed this my girlfriend I realized that none of the projects have phpdoc-comments. So I found out that this optimization does not matter for projects that you would not process. Anyone else want to state the obvious?
So I repeated the experiment with the sources of PhpDocumenter itself, which are well-documented. The results for these sources are less disappointing. It turns out that pre-computing the size of the array saves 6 seconds in the frames/default configuration, and up to 9 seconds in the DOM/default configuration. These number are not very large, but it seems that this simple optimization really saves some time. If you want to know more about optimizations that are more useful, please click here
Optimization is nice, but has been discussed a lot and not very original anymore. So let's look at some results that indicate logical errors. A pattern that is not PHP-specific is C006, which is an ignored return statement. I have implemented this pattern because OWASP has an article about it. When I inspected the results I learned that Pear actually has functions that willalways return 'true'. This situation is probably worth a pattern, but it also generates extra detections because the result of these functions are ignored.
But there also was an ignore that was not legit and this resulted in the first bug-report coming from PHP-Sat. Let's hope that it will get accepted and is fixed in the near future.
When I looked at the results from the C006 pattern I also read the comments in some of the files in Pear. The following piece of comment caught my eye:
.. constructor
@param ....
@param ....
@return mixed True on success else PEAR error class.
How can a constructor return either True or an other object? A constructor is supposed to return the newly created object, so a return statement in a constructor is useless. (PHP4 allows you to return something else from a constructor by assigning a value to the '$this'-variable, but this is not compatible with PHP5 or good practice.) I have also found a thread that shows that these patterns should be checked.
So I added the patterns C007 and C008 to PHP-Sat. I could not find any assignments to '$this' in the stable packages of Pear, but I found 13(!) occurrences of a return value within a constructor. Good luck for me, bad luck for the package maintainers?
So let's look at some results. I found out that there 91 detections of pattern O000 in the PhpDocumentor-project. Since this suggests that you should always pre-calculate the size of an array I decided to do a little test. I first used the original sources of PhpDocumentor to extract the documentation of 'phorum', 'phpBB2' and 'phpMyAdmin'. After this I modified all the for-loops of PhpDocumentor that where flagged by PHP-Sat and extracted the documentation again.
The results where rather disappointing, maybe a single second somewhere, but nothing really substantial. When I showed this my girlfriend I realized that none of the projects have phpdoc-comments. So I found out that this optimization does not matter for projects that you would not process. Anyone else want to state the obvious?
So I repeated the experiment with the sources of PhpDocumenter itself, which are well-documented. The results for these sources are less disappointing. It turns out that pre-computing the size of the array saves 6 seconds in the frames/default configuration, and up to 9 seconds in the DOM/default configuration. These number are not very large, but it seems that this simple optimization really saves some time. If you want to know more about optimizations that are more useful, please click here
Optimization is nice, but has been discussed a lot and not very original anymore. So let's look at some results that indicate logical errors. A pattern that is not PHP-specific is C006, which is an ignored return statement. I have implemented this pattern because OWASP has an article about it. When I inspected the results I learned that Pear actually has functions that will
But there also was an ignore that was not legit and this resulted in the first bug-report coming from PHP-Sat. Let's hope that it will get accepted and is fixed in the near future.
When I looked at the results from the C006 pattern I also read the comments in some of the files in Pear. The following piece of comment caught my eye:
.. constructor
@param ....
@param ....
@return mixed True on success else PEAR error class.
How can a constructor return either True or an other object? A constructor is supposed to return the newly created object, so a return statement in a constructor is useless. (PHP4 allows you to return something else from a constructor by assigning a value to the '$this'-variable, but this is not compatible with PHP5 or good practice.) I have also found a thread that shows that these patterns should be checked.
So I added the patterns C007 and C008 to PHP-Sat. I could not find any assignments to '$this' in the stable packages of Pear, but I found 13(!) occurrences of a return value within a constructor. Good luck for me, bad luck for the package maintainers?