Solving annoying things

In my previous post I mentioned the nasty things in parsing double quoted strings. One of the problems was that the slash is interpreted differently based on what follows it. As it turns out, there is an error in the post. The string "\0123" is parsed to "S". So if an octal character is defined, it will always be recognized as one.

But the problem of the evil slash is solved! I will explain the solution off course. The explanation might be a bit detailed and boring for most people, but this might help some people.

A double-quoted string is parsed as a list of DoubleQuotedPart elements. A list of elements is not uncommon in a SDF. Each element in the list represents an element of the string. This could be one of the following:

  • A literal part, e.g. "foo"

  • An octal character encoding, e.g. "\12"

  • A hexa character encoding, e.g. "\x2"

  • One of the special characters, e.g. "\n"

  • Variables


The last case is to be handled later. We first have to define the basic thing, the literal part.
     (~[\"\\\$] | SlashCharLit | DollarCharLit)+ 
-> DoubleQuotedLit
"\\" -> SlashCharLit
"$" -> DollarCharLit

It might look a bit funny to say that a literal is everything except a slash, or an slash, but this is usefull when we define the Hexa characters.
  syntax
"\\" "x" [0-9A-Fa-f]
-> HexaCharacterOne {cons("HexaChar")}
"\\" "x" [0-9A-Fa-f][0-9A-Fa-f]
-> HexaCharacterTwo {cons("HexaChar")}

HexaCharacterOne -> HexaCharacter
HexaCharacterTwo -> HexaCharacter

restrictions
HexaCharacterOne -/- [0-9A-Fa-f]

These definitions together make sure that we can parse the string "\x01" in two ways. As a HexaCharacter or a single literal. We can solve this by defining a follow restriction on the slash
    SlashCharLit -/- [x] . [0-9A-Fa-f]
SlashCharLit -/- [x] . [0-9A-Fa-f] . [0-9A-Fa-f]

This is indeed a complete specification of a HexaCharacter. But this makes sure that a HexaCharacter is not parsed as a Literal.

There is only one problem left. The following string "foo \ bar", a string with a simple slash that does not escape anything, can be parsed in two ways with this definition. Either Literal("foo \ bar") or [Literal("foo "),Literal("\ bar")]. This is not what we want. So we have to make sure that the shortest list of parses is preferred. This means that we have to make sure that all escapes are rejected as literals, but this can be done the same way as stated above. This problem is solved by adding the following line to the SDF
    DoubleQuotedPart+ DoubleQuotedPart+ 
-> DoubleQuotedPart+ {avoid}
It states that lists with more elements should be avoided. So these are only accepted if there are no other choices, which solves our slash problem.

I might take some big steps in this explanation. Please ask if something is not clear.

No comments: