Bouwers > Eric > Blog: June 2006

Solving annoying things

In my previous post I mentioned the nasty things in parsing double quoted strings. One of the problems was that the slash is interpreted differently based on what follows it. As it turns out, there is an error in the post. The string "\0123" is parsed to "S". So if an octal character is defined, it will always be recognized as one.

But the problem of the evil slash is solved! I will explain the solution off course. The explanation might be a bit detailed and boring for most people, but this might help some people.

A double-quoted string is parsed as a list of DoubleQuotedPart elements. A list of elements is not uncommon in a SDF. Each element in the list represents an element of the string. This could be one of the following:

A literal part, e.g. "foo"

An octal character encoding, e.g. "\12"

A hexa character encoding, e.g. "\x2"

One of the special characters, e.g. "\n"

Variables

The last case is to be handled later. We first have to define the basic thing, the literal part.

     (~[\"\\\$] | SlashCharLit | DollarCharLit)+ 
           -> DoubleQuotedLit
     "\\" -> SlashCharLit
     "$"  -> DollarCharLit

It might look a bit funny to say that a literal is everything except a slash, or an slash, but this is usefull when we define the Hexa characters.

  syntax
    "\\" "x" [0-9A-Fa-f]            
           -> HexaCharacterOne {cons("HexaChar")}
    "\\" "x" [0-9A-Fa-f][0-9A-Fa-f]  
           -> HexaCharacterTwo {cons("HexaChar")}

    HexaCharacterOne -> HexaCharacter
    HexaCharacterTwo -> HexaCharacter

  restrictions
    HexaCharacterOne -/- [0-9A-Fa-f]

These definitions together make sure that we can parse the string "\x01" in two ways. As a HexaCharacter or a single literal. We can solve this by defining a follow restriction on the slash

    SlashCharLit -/- [x] . [0-9A-Fa-f]
    SlashCharLit -/- [x] . [0-9A-Fa-f] . [0-9A-Fa-f]

This is indeed a complete specification of a HexaCharacter. But this makes sure that a HexaCharacter is not parsed as a Literal.

There is only one problem left. The following string "foo \ bar", a string with a simple slash that does not escape anything, can be parsed in two ways with this definition. Either Literal("foo \ bar") or [Literal("foo "),Literal("\ bar")]. This is not what we want. So we have to make sure that the shortest list of parses is preferred. This means that we have to make sure that all escapes are rejected as literals, but this can be done the same way as stated above. This problem is solved by adding the following line to the SDF

    DoubleQuotedPart+ DoubleQuotedPart+ 
                      -> DoubleQuotedPart+ {avoid}

It states that lists with more elements should be avoided. So these are only accepted if there are no other choices, which solves our slash problem.

I might take some big steps in this explanation. Please ask if something is not clear.

Little annoying things

Working on the Syntax definition is fun because you really learn to understand the language. But it can also be frustrating. This evening I was working on the syntax of double quoted strings. Double quoted string in PHP expand some variables when the string is parsed. See the documentation for more details.

It turns out that parsing a double quoted strings is a real challenge. An backslash is recognized differently depending on the context it is used in. I found this out when I tried to correctly parse the following string:

"Octal: \123"

In PHP this will be evaluated to

"Octal: S"

So the "\123" part should be parsed as a separate part. However, if it says "\1234" it is just a part of the string since octal characters can only be encoded by three digits.

Another example is that of the advanced string parsing features. Examine the following strings

$foo = "Hello {$name}";
$bar = "Hello { $name}";

In the variable $foo, the brackets are discarded and the variable $name is expanded. The variable $bar just contains the exact same string as stated above, including the brackets and the dollar sign.

These examples are a nightmare from the parsing point of view. So I will let this rest for now and move on to the statements of PHP in the following session. This should keep me from throwing my screen out of the window because of frustration.

Making progress

It has been a relatively long time since I updated the blog. I am not really used
to show the world what I have done. But I think this will come over time.

But what has been done lately? I have struggled a lot with samba and network
problems. I did not get it to work the way i wanted it to do. I wanted to have a network drive because the text editors in the virtual machine where becoming very slow.
But luckily this is all solved by setting up an FTP server and using netdrive. This is actually a nice tool to use if you are used to windows 2000. It maps any FTP-site as a network drive. So now I can work with my favorite windows editor and the compilation can take place on the Linux machine. This might not deliver a lot of code, but is sure speeds up the development process from now on.

But has the project made any progress yet? Off course it has! I am beginning to understand the yacc-files of the distributions a lot better now. The translation to the SDF is making good progress. There are some difficulties with all this, but they are solvable. More test files from the distribution are passable already, so this indicates that I am on the right track. If you really want to be kept up to date with the latest status I suggest you subscribe to the psat-commit mailinglist.

Testing 1 2

Just to let you know what is going on a little update. Parsing of (some) literals works quit nice and we already have full support for integers, floats and strings. The Heredoc format is a bit tricky and needs some more work. Heredoc blocks with two labels that do not match will also be parsed, but these should be rejected.

I say that we have full support for the above things, but there is already more that can be parsed. The above mentioned literals are the ones that are tested by the test suite that runs before the project is accepted at the build-farm. So everything else that is being parsed is not yet been tested. An example of a test is:

test hexadecimal integer 
  "0x1A" -> LNumber(Hexa("0x1A"))

So the little things are already tested. This will give a solid basis for the rest of the parser.

To see the overall status of the parser, take a look at the output of parsing the test-files included in the distributions. The goal is to get a cool smiley after every file.

Growing trees

After my last post I dived into the wondrous world of PHP-syntax and the SDF
of StringBorg. This SDF will generate a parse table that can parse PHP-files and generates
a representation in the ATerm format. A simple example:

$a = 1 + 3

can be parsed into:

Assign( LitVar("a"),
        Plus(LNumber("1"), 
             LNumber("3")
        )
)

For small things, like operators, this is easy. But how to translate a script into a tree? And how do you call all the different construct?

Luckily there is a list of parser tokens used in the actual parser. If you ever made a typo in a PHP script, you probably did if you ever wrote anything in PHP, then you will recognise some of the terms in the list.

The quest for a mentor, success!

I submitted my application for the SoC to both Google and PHP. A few days before the results came in my application from PHP was moved to Google and the one from Google was ranked 'Ineligible'. Luckily, the other one, originally targeted at PHP, was accepted. But this one did not have a mentor. A few days ago my application was directed back to PHP again, but still no sign of a mentor. Yesterday I proposed to Google that someone from my university would mentor me since PHP did not respond to their emails. They accepted the proposal and the mentor.
So I proudly present my mentor 'in name': Martin Bravenboer.
Eelco Visser will also help with the mentoring part, but only one of them could be named as the 'real' mentor.

One of the things is really nice about my mentor(s) is that they know a lot about Stratego, I can drop in when I am in the neighbourhood and they have experience in setting up a nice development / distribution / testing system for a project. So today I dropped in and this resulted in a real kick start and all sort of goodies.

So, if you are interested in the commit-messages of the repository you should subscribe yourself to the psat-commit-mailing list. Of you are only interested in the discussions about the project subscribe to the psat-dev-mailing list. You can also look at the issues and the bugs to be solved you can take a look at the Jira-project of psat.(To be filled any day) But if you are interested in actually getting the source code take a look at the release page of psat. RPM's are also available for various distributions, nice isn't it :)

Pixy, a java approach

I like to point to a tool that will be useful for developing my application: Pixy. This tool takes the same approach that I want to take to return feedback about vulnerabilities in PHP-applications. If you have read my application you probably took a look at it already.

For those who did not, here is a tiny overview. The PHP-files are parsed by JFlex and Cup to construct a parse tree. This tree is transformed into a linear form resembling three-acces code. This is the form on which the flow-sensitive, interprocedural, context-sensitive analysis is conducted. For more details I recommend reading the short paper.

Pixy performs well, but has at least one major limitation, it does not support the object-oriented features of PHP. The online services of the scouts organisation of the Netherlands are almost completely OO, but cannot be scanned. I hope to solve this limitation with my solution.

The advantage of Pixy is that it uses the original Flex and Bison specifications. Stratego requires a SDF, so let's get started!

Naming the beast

Apart from the environment we need the repository to get things going. The repository will be located at my university since they offer basic svn repositories and that is all I need.

Creating the repository off course involves the naming of the thing. The name of the repository is not really interesting, but the project should get a name. It should include the fact that it deals with PHP, that is uses static analysis and it should be catchy. So if anyone can think of a good name please let me know. The only thing I could think of right now is PSAT, PHP Static Analysis Tool, but anyone could figure that out.

While you think you could look at StringBorg. This project is about SQL-injection in different languages, including PHP. The syntax definition of PHP that is used within that project will be the basis for this project.