Bouwers > Eric > Blog: June 2007

It is getting noticed

Within the last two weeks I noticed that PHP-Front and PHP-Sat are being discovered by people that are looking for PHP-specific solutions. It is not that we are flooded with request, but I am still happy with every question :)

The first question was send to the psat-dev-mailinglist and was about the Cyclomatic complexity of PHP code. I replied that it would not get into PHP-Sat because it is not a bug-pattern. However, it would be a nice tool for the PHP-Tools project. I made a similar tool for Java because of an assignment in the past, so it is probably just a matter of renaming the Strategies to use the PHP-Front api. Unfortunately, I didn't get an answer about how the report should look like. If you have any ideas please let me know in the comments, or in the issue.

A second question was about the grammar of PHP-Front, or actually the license of this grammar. The people behind TXL have derived a PHP-grammar for TXL from the SDF-grammar in PHP-Front. Since our license does not state anything about derived work without common source, we were asked for our permission to distribute this new grammar. Naturally, this permission was given very quickly and we were also allowed to take a peak at the source. I must say that I find it interesting, but I currently do not have time to look into it all. I imagine that the definitions of the grammars and TXL itself is similar to how things work in Stratego, but I have to look into that.

The last question in this series is about defined functions. Finding out which functions are defined in a project is easy when classes are ignored, a simple grep on the project will do. When a project also includes classes it becomes trickier to get all functions defined outside of a class. The question was whether PHP-Front could help with this issue, and the answer is of course yes!
Within the reflection part of the library the list of defined functions and classes is already available. This makes it possible to write a tool to show all defined functions in just a few lines of code. Since it was also a nice tool for the PHP-Tools project I added a tool for this last Friday.

Another issue that was brought up by the last e-mail is the issue of our implementation language. Since Stratego is relatively unknown the project has a steep learning curve. On the other hand, if I had chosen a different implementation language it would have taken me way longer to implement the current features. And besides, this piece of code is not that hard to understand right?

  defined-functions-main =
    include-files-complex
  ; get-php-environment
  ; get-functions
  ; if ?[]
    then !"No functions defined."
    else  map(transform-to-message)
        ; lines                    
    end

Where you at?

Now that the propagation of safety-types seems to go smoothly it was time to dive into another subject: accessing the location of terms. In this case, the location of a term is defined as the location of the text in the original file that was parsed to that term. Since the strategy to annotate the AST with position-info is available in the standard libraries nowadays, it should be easy to access these locations and finally solve PSAT-91 right? Lets find out!

The first thing I did was to add a separate module to handle the low-level stuff of getting the location annotations. This module contains several getter-strategies that can retrieve, for example, the number of the start-line. The location info is captured in six different numbers: start-line, end-line, start-column, end-column, offset and length. A getter-strategy is available for all of them. Furthermore, the name of the file in which the term is defined can be retrieved.

Although these getter-strategies are useful, they are not meant to be called directly. I figured that the most common use of these functions would be reporting the values in some kind of (formatted) message. In order to capture this kind of behavior the strategy format-location-string(|message) is defined. This strategy takes a message with holes in the form of [STARTLINE] as parameter and fills these holes with values from the current term. A rather useful strategy if I say so myself.

To practice with this new piece of functionality I have added an extra option to the tool input-vector of the php-tools-project. This option allows the user to choose between the normal list, or the same list with line-numbers printed for each access. More information about this option and how to add an option yourself can be found here.

After this was done I moved to php-sat to make the output more concise. It was actually pretty easy to implement. The algorithm is nothing more then get-terms-with-annotations, make-nice-output. I actually spend more time on creating a test-setup for calling php-sat through a shell-script then on generating the more concise format. The only problem was that the adding of position-info everywhere interfered with the dynamic-rules. A few well-places rm-annotations where needed to fix this. Please let me know if you like the new output, or whether something should be added.

The next applications of the location info is the tracking of where untainted data enters an application. When a function is called with a parameter $foo which is tainted, it would be nice to show when it was tainted. I think this is not too difficult to add, but bugs always seem to lurk in 'I-think-it-is-easy-to-add'-features.

A last remark about locations is a small problem without an actual solution. Eventually php-sat must support function-calls. The algorithm to analyze function-calls is not complicated, but how can bug-patterns within a function be reported? A message before each call to this function? Within the file in which the function is defined? And what about cases in which one call is flagged and the other one isn't? And can we also handle object-creation in the same way? I haven't figured out how to handle this, so if you have any ideas please let me know.

Finetuning the third phase

You might remember an earlier post about the algorithm for the third phase of generating feedback. This algorithm takes two terms and a set of (allowed) rewrite rules. After calculating the rewrite rule to transform the first term into the other, the algorithm chooses the rule from the set that is 'closest' to the calculated rewrite rule. This chosen rule can now be used to generate better feedback because it gives some insight in the intentions of the student.

In theory the algorithm functions quit well, and the same goes for the actual implementation. However, during the implementation of the algorithm I have run into a few details that needed to be fixed. This once again shows that theory and practice usually do not match completely :)
The following is a (small?) list of some of the refinements made during the implementation.

The rules(et)

One assumption that was made during the design of the algorithm is that rules must rewrite the current term into a term that is closer to the answer. For example, the rule A + B => B + A is correct from a mathematical point of view, but it does not help a student to get closer to an answer.
A second assumption that was made for the algorithm is that the ruleset is extended by combined rules. For example, the rules A => B and B => C are combined into a single rule A => C. The calculation of the combined rules is not very difficult, but getting the calculation to stop is a bit trickier.
The first experiment with rules from the domain of fractions went reasonably well, the only restriction that needed to be added was that restrictions on variables may not depend on themselves. So the restriction ... where B := B is not allowed.
The second experiment with rules from the domain of logic went into endless recursion, a clear sign of an additional problem. It turned out that rules where merging with themselves, resulting in an extra restriction to prevent this kind of behavior.

Calculating the rule

My first idea about the calculation of the rewrite-rule was a simple bottom-up traversal which returned a meta-variable as long as the (sub)-terms are equal. This naive algorithm worked reasonable during the first tests, but failed to perform well at larger examples. In these larger examples the calculated rule contained too much noise to see the important parts of the rewrite rule.
In an attempt to cut down the noise the algorithm was extended to find out whether a sub-tree of a rule can completely be removed. An example of this is the calculation of the rewrite rule between 4 + 5 + 2 and 4 + 6. The naive algorithm would give A + B + C => A + D as an answer. However, inspecting the terms gives the rule B + C => D as a more accurate match. This last rule still contains the same information, we have simply removed the noise of the A + .-sub-tree.

The first implementation of this extension only checked operators with an equal amount of children, and only with a left-to-right match. Even though this worked in some situations, it was still not good enough to be useful in general. Therefore, the matching part of the extension was improved by taken into account the associativity and commutativity of the operators. The different combinations of these properties either result in an exact-match (left-to-right), an assoc-match (left-to-Right, right-to-left) or an assocComm-match (all combinations of children are tried with an assoc-match).

Distance between rules

As with the calculation of the rules, the distance between the rules started with a naive algorithm covering several cases:

When we have two equal (sub)-trees we return a distance of 0.
When we have two nodes in the trees we compare the name of the nodes and all the children from left-to-right, adding distance when the two operators are not equal.
When either one of the trees is a leaf we check whether it is a meta-variable. If this is not the case we simple add a distance of 1 (for the leaf) and the size of the other (sub)-tree. If the leaf is a meta-variable we check whether it is a free variable, and return a distance of 0 if it is. Otherwise the same '1 + size (sub)-tree' is returned as distance. A variable is free when it is not previously bound by a match or a restriction.

Note that the above algorithm does not explain what should happen when two nodes have a different amount of children. After some testing I found out that it works quit well to just take the size of the extra children together with a penalty.

Another thing that is not considered in the above algorithm is the 'forgetting' of an operator. An example of this is the distance between the rule ~~A -> A and ~~A -> ~A. With the naive algorithm the distance between these rules can be quit large when the A in the rule is a large sub tree. In order to model this 'forgetting' the algorithm checks whether a node has a single child on one side. If this is the case the algorithm calculates the distance between the node with operator and the distance between the node without the operator. It then takes the smallest one of the two, taken into account some extra costs for adding the operator.

Putting it together

Now that all the sub-parts worked correctly the top-level algorithm needed to be implemented and tweaked. A first thing that was tweaked was the filtering of a list of matches. When more then one rule has the small distance, the algorithm first returned the first element of that list. Since this behavior makes the ordering of the rules important it needed to be changed. Now the algorithm takes the rule from the list in which the LHS of the rule is closest the current Previous Term.

Speaking of Previous Terms, the original Previous Term does also influence the choice for a rule. When we go into recursion by applying rules, the current PT changes. The changed PT matches better with rules that have a similar LHS, thus this LHS is different from the original PT. Assuming that students apply rules that at least partly match the LHS of a rule, we need to take the distance between the original PT and the LHS of the rule into account. Furthermore, we also take into account the number of rules that are applied to the PT in choosing a rule.

Does it work?

After all the tweaking and the bux-fixing the answer to this question is: Yes! You can take a look at some of the test-cases which list a PT, a CT and a desired result from the defined set of rules.
The tests-modules also include examples of situations which do not produce the correct result. This has to do with the fact that the virtual student made more then one mistake in the application of the rule. This shows that the algorithm does not work in all situations, but this was also not the goal of this algorithm. Luckily, the other examples show that the algorithm is capable of producing desirable output in a number of situations.