Infragistics Parsing Framework - Parsing 101

Mike Dour / Monday, October 22, 2012

This is the first in a series of posts I am going to write about the parsing framework shipped as a CTP in the 12.2 release of our WPF and Silverlight products. This framework is used internally by the new XamSyntaxEditor control on those platforms. This post will be an introduction to both parsing in general as well as the type of parser provided by us.

Introduction

The Infragistics parsing framework allows you to define the grammar for a language from which a parser will be created to analyze documents in that language. If you are already familiar with parsing and LR parsers, you can just read the “Supported Languages”, “Ambiguities” and “Performance” sections.

Note: it is helpful to think of a document as a single horizontal sequence of text rather than text which goes from top to bottom as it does in an editor. This is because I will be introducing concepts such as “top-down parsing” and “bottom-up parsing” which relate to the top and bottom of a tree structure and not the top and bottom of a document in an editor.

Lexical Analysis

The first two phases required for analyzing a document in a certain language are lexical analysis and syntax analysis. I am not going to get into too much detail on lexical analysis, but basically, it is the process of scanning a document and grouping characters into meaningful units called tokens. By doing so, lexical analysis imposes a structure on the string of characters of a document to create a string of tokens. For example, the lexical analysis of a C# document might create tokens for keywords, identifiers, integral constants, braces, comments, whitespace, and other elementary units of the C# language. Most C# tokens are significant, but tokens such as whitespace or comments are not. The significance of each token is defined by the grammar. So in some languages (such as VB), whitespace can be significant. The syntax analysis phase only considers significant tokens. This makes the job of the grammar writer much easier because he or she does not need to define all the areas in which comments and whitespace are allowed. Doing so might make a grammar definition overly complicated. From now on, I may refer to lexical analysis as lexing or to the lexical analyzer as the lexer.

Syntax Analysis

Similar to how the lexer imposes a structure on a string of characters, syntax analysis, or parsing, imposes a structure on the string of tokens created by the lexer. The structure created by the parser in this phase is a parse tree, or concrete syntax tree, representing the grammatical structure of the tokens. How that structure is formed depends on the rules defining what is allowed by the language being parsed. Here is a typical parse tree for a C# document:

Parsing is the process of determining how to convert a string of tokens into a parse tree structure. Here is the string of tokens represented by the parse tree above (Note: each token is surrounded by angle brackets and some tokens also have the associated text from which they were formed):

One important aspect of the tree formed by this token stream is that it preserves the order of the original string of tokens if we recursively read a node’s children from left-to-right, as can be seen by visually rearranging the leaf nodes in the above tree:

Supported Languages

The Infragistics parsing framework supports any context-free language, which is defined as a language that can be described by a context-free grammar, or CFG. To understand what a CFG is, some components must be defined:

Terminal Symbol – An elementary unit of the language. In a document without errors, each token produced by the lexical analysis phase will correspond to a single terminal symbol of the grammar. The keywords, braces, and identifiers of the previous example are terminal symbols.
Non-Terminal Symbol – A language element representing a string of zero or more symbols, each of which can be a non-terminal or terminal. Each non-terminal symbol can have one or more variations of symbols it represents. The interior nodes of the previous example, such as “NamespaceBlock” and ClassBody”, are non-terminal symbols.
Production – A single variation of a non-terminal symbol. Productions are typically written in this form: NonTerminal → Symbol1 Symbol2 Symbol3 … The left side, or head, of the production is the non-terminal designated by the production and the right side, or body, is the string of symbols which is represented by the non-terminal symbol in the head.
Start Symbol – One of the non-terminal symbols of the grammar which will be at the root of the parse tree. “Code” in the previous example is the start symbol.

A CFG is defined as having a set of terminal symbols, non-terminal symbols, productions, and one of the non-terminal symbols designated as the start symbol. Grammars can be defined in Extended Backus-Naur Form, or EBNF. In the Infragistics EBNF file format, the terminal symbols and start symbol are defined through XML in an extensible block of the EBNF file (text separated by question marks). Non-terminal symbols and productions are defined simultaneously in the normal EBNF language, like so:

Code = {UsingStatement}, {NamespaceBlock | ClassBlock};
NamespaceBlock = NAMESPACE_KEYWORD, NamespaceName, NamespaceBody;
NamespaceName = IDENTIFIER, {DOT, IDENTIFIER};
NamespaceBody = OPEN_BRACE, {NamespaceBlock | ClassBlock}, CLOSE_BRACE;
…

In EBNF, the comma (,) represents concatenation, the bar (|) represents alternation (the string of symbols on either side of the bar can be used), and the braces ({…}) represent zero or more repetitions of the contained string of symbols, also known as the Kleene closure. Not shown in this example, but also available in EBNF files, are square brackets ([…]) which enclose an optional sequence of symbols.

These grammar rules define both the non-terminal symbols and the productions. For example, the NamespaceBlock rule defines the NamespaceBlock non-terminal symbol and one production:

NamespaceBlock → NAMESPACE_KEYWORD NamespaceName NamespaceBody

Whereas the NamespaceBody rule defines the NamespaceBody non-terminal symbol and an infinite set of productions:

NamespaceBody → OPEN_BRACE CLOSE_BRACE
NamespaceBody → OPEN_BRACE NamespaceBlock CLOSE_BRACE
NamespaceBody → OPEN_BRACE ClassBlock CLOSE_BRACE
NamespaceBody → OPEN_BRACE NamespaceBlock NamespaceBlock CLOSE_BRACE
NamespaceBody → OPEN_BRACE NamespaceBlock ClassBlock CLOSE_BRACE
NamespaceBody → OPEN_BRACE ClassBlock NamespaceBlock CLOSE_BRACE
…

However, we internally create a hidden non-terminal symbol to represent the repetition, so that a limited number of productions can be created.

An important point to note about this grammar is that grammars can be recursive: NamespaceBody, a child of NamespaceBlock, can contain another NamespaceBlock symbol. And non-terminals can even refer to themselves:

Statement → …
Statement → IF OPEN_PAREN Expression CLOSE_PAREN Statement ELSE Statement
Statement → …

Parsing Process

The process of parsing a document can be viewed as finding some way to derive the start symbol into the string of tokens created by lexing the document. A derivation is a sequence of steps where in each step a non-terminal is replaced by the body of one of its productions. Here is the beginning of one such derivation which could be used to parse the sequence of tokens in our example, assuming that “Code” is defined as the start symbol of the grammar (the non-terminal symbols have been highlighted):

Code
use production Code → NamespaceBlock
NamespaceBlock
use production NamespaceBlock → NAMESPACE_KEYWORD NamespaceName NamespaceBody
NAMESPACE_KEYWORD NamespaceName NamespaceBody
use production NamespaceName → IDENTIFIER
NAMESPACE_KEYWORD IDENTIFIER NamespaceBody
use production NamespaceBody → OPEN_BRACE … CLOSE_BRACE
NAMESPACE_KEYWORD IDENTIFIER OPEN_BRACE … CLOSE_BRACE

Note that the syntax tree in our example is a graphical representation of this derivation. Each parent node in the tree represents the head of a production used in the derivation and the children represent the body of that production. It is duplicated here:

Therefore, the productions in a grammar really dictate how the parse tree will be constructed.

Although it is helpful to think of parsing in this top-down fashion, the Infragistics parser is actually a type of LR parser, which is a left-to-right scanning, bottom-up parser, so the derivation steps are performed in reverse. It starts with the string of tokens produced by the lexical analyzer, reads them from left-to-right and replaces the body of a production with the non-terminal head of that production, like so:

NAMESPACE_KEYWORD IDENTIFIER OPEN_BRACE … CLOSE_BRACE
use production NamespaceName → IDENTIFIER
NAMESPACE_KEYWORD NamespaceName OPEN_BRACE ... CLOSE_BRACE
…
use production NamspaceBody → OPEN_BRACE … CLOSE_BRACE
NAMESPACE_KEYWORD NamespaceName NamespaceBody
use production NamespaceBlock → NAMESPACE_KEYWORD NamespaceName NamespaceBody
NamespaceBlock
use production Code → NamespaceBlock
Code
Code is the Start Symbol, so the parse is complete

These reverse derivation steps are called reductions because the string of symbols in the body of a production is reduced to the non-terminal symbol in the head of that production.

Ambiguities

The Infragistics parser is a very powerful type of parser known as a generalized-LR parser, or GLR parser. This means that when parsing a document, it will simultaneously attempt to use all possible reductions allowed by the grammar to construct the parse tree. So, let’s say the following are some of the rules in a (simplified) C# EBNF grammar definition:

…
ClassName = IDENTIFIER, [TypeArgumentList];
TypeArgumentList = LT, TypeArgument, GT;
TypeArgument = IDENTIFIER;
…
ComparisonExpression = VariableReference, (LT | GT), VariableReference;
VariableReference = IDENTIFIER;
…

And within the string of tokens produced by the lexer is the following sequence:

… <IDENTIFIER (X)> <LT> <IDENTIFIER (Y)> <GT> …

After the first IDENTIFIER token, there are two possible actions the parser can take:

Perform a reduction by replacing the production body of VariableReference → IDENTIFIER with the VariableReference non-terminal symbol.
Leave the IDENTIFIER as it is are because it may be a prefix of the ClassName → IDENTIFIER TypeArgumentList production body.

In cases like these, the parser essentially splits into two logical parsers so both options can be attempted. Then when the GT token is eventually encountered, the parser which reduced to the VariableReference non-terminal symbol will see that it cannot go any further because the rules do not allow a greater than sign to follow a ComparisonExpression, and that parser goes away, leaving the parser from option 2 as the only remaining parser.

This ambiguity is a type of local ambiguity. The symbols were ambiguous until the parser processed more of the tokens and the ambiguity was resolved because only one option was valid. But there is another type of ambiguity, known as a global ambiguity. This is one which cannot be resolved, regardless of how far the parser reads through the string of tokens from the lexer. This essentially means that there is more than one way to interpret some portion of the text in the document. Consider the following C# code statement from within a method body:

A(B<C,D>(1+2));

There are two valid parse trees which can be created to represent this statement:

In one interpretation, the ‘A’ method is passed two parameters which are the results of two comparison expressions. In another interpretation, the ‘A’ method is passed a single parameter, which is the result of an invocation of generic method ‘B’. This is actually a known breaking change introduced by generics support in C# 2.0 and described in this post by Eric Lippert: http://blogs.msdn.com/b/ericlippert/archive/2007/08/30/future-breaking-changes-part-one.aspx. The first parse tree is how C# 1.0 parsed the expression and the second parse tree is how C# 2.0 parsed it.

As a general rule, to handle a global ambiguity, a grammar writer must have a way of saying “If something can be interpreted as X, it MUST be interpreted as X.” In the case of generics, they have priority in C# 2.0, so we need to say “If something can be interpreted as a generic, it MUST be interpreted as a generic.” This is done through the use of an extensible block in the EBNF file where the grammar writer indicates that a specific non-terminal has priority. So in this case, the grammar writer would want to give priority to the TypeArgumentList rule:

?<HasPriority>true</HasPriority>?;
TypeArgumentList = LT, TypeArgument, {COMMA, TypeArgument}, GT;

Now when the parse occurs, the two parse trees will still be generated, but the second one will have a priority node marked in the tree:

When the parser finishes generating these trees, it is obvious that reading more symbols will not disambiguate these parses and it has found a global ambiguity. At that point, it walks into each tree looking for a node which is marked to have priority at the lowest depth away from the root. In this tree’s case, the depth is four because the fourth level of descendants contains a priority node. In the other tree, the priority depth is infinite because there is no priority node within it. Then the tree with the lowest priority depth is chosen. If both trees have the same priority depth, the one with more priority nodes at that depth is chosen. If they are still equal after checking the priority count or neither tree contains priority nodes, an unresolved global ambiguity has occurred. In the CTP version of the parsing framework, it will arbitrarily pick one parse tree to use and there is no way to determine that this has happened. In the RTM version of the parsing framework, there will be some event or callback exposed which will help detect when this has happened.

Performance

Normally, GLR parsers are too slow to use in production. But the GLR parser in the Infragistics parsing framework is highly optimized in three different ways:

The parsing performance is exactly that of a non-generalized LR parser when only one option is available for the current token from the lexer.
When two parsers begin parsing the same type of sub-expression for the same tokens, they will merge back into one parser temporarily until the sub-expression has been fully parsed. Then they will split again.
When multiple parsing options are available, all nodes are shared if they are needed by both parsers. In the example of the local ambiguity above, the structure might look something like this after scanning past the second IDENTIFIER token (Note: this example only shows the parsers sharing terminal symbol nodes, but non-terminal symbol nodes can also be shared as portions of the tree are being constructed):

The first optimization is the most important one for the grammar writer to keep in mind. The overall speed of parsing a document is dependent on how unambiguously the grammar rules are defined. So to get the best performance possible from the Infragistics parser, the grammar writer should write an unambiguous grammar, or a deterministic context-free grammar. But for many languages, this is either not possible or doing so would ruin the structure of the resulting parse tree so that it no longer reflects the grammatical structure of the document. In those cases, the grammar writer should try to keep the number of ambiguities to a minimum. I will be writing a post shortly on tips for removing ambiguities from a grammar. And we hope to provide grammar writers with a utility to help disambiguate grammars in the near future.

Other Phases

After lexing and parsing, other phases which can occur are semantic analysis, code generation, code optimization, or others. Currently, the Infragistics parsing framework does not handle these phases, but the developer can perform them by obtaining the finished parse tree and performing various operations on it.