Log in to like this post! Infragistics Parsing Framework - Syntax Tree Pruning Mike Dour / Friday, April 12, 2013 This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here and the next post can be found here. Last time we saw how ambiguities can be removed from grammar rules for binary operator expressions, but the resulting syntax tree was relatively dense and contained a lot of nodes which can be ignored by code consuming the syntax tree in most cases. To fix this problem, you have the ability to prune unimportant nodes out of the syntax tree. This can be done automatically for the operator precedence rules we saw in the last post and/or manually by naming specific nodes which should never appear in the syntax tree. Note: Before I get into the specifics of tree pruning, I want to briefly talk about the upcoming 13.1 release. During this release, there was a lot of work put into the parsing framework (i.e. syntax parsing engine) to bring it from CTP quality to release quality. As part of that effort, decisions were made which required changes to the API as well as the EBNF format recognized by the framework. From now on the EBNF snippets in my posts will use the new format instead of the older 12.2 format. The changes to the EBNF format were in the special sequences containing XML. The normal EBNF grammar rule syntax did not change. Also, I plan on creating a utility soon which will read in a 12.2 EBNF file and write out an equivalent 13.1 EBNF file. When that utility is finished, I will post it here. Going forward, whenever I discuss the parsing framework API I will refer to the 13.1 object model. As I mentioned above, there are two ways of pruning the syntax tree to remove superfluous nodes, both of which are enabled by default. They are known as “based on children” and “based on name” pruning and they can be specified with the Grammar.SyntaxTreePruningMode property. Pruning Based on Children The first pruning mode I’ll discuss is the “based on children” mode. This is necessitated by the problem we saw in the previous post. The operator precedence rules do a good job of disambiguating operator expressions, but they produce a relatively dense syntax tree. This could complicate or slowdown logic which needs to walk over the tree. But when the “based on children” pruning mode is being used, most of that clutter will be cleared up automatically. This pruning mode will remove any non-terminal symbol node which is the parent of a single node which is also a non-terminal symbol node. So, as an example, let’s look at this tree from the previous post: This tree represents the expression x + y * z using the proper operator precedence, but there is a lot of clutter. If the “based on children” pruning mode were used, the following nodes would automatically be removed because they are non-terminal nodes which own a single other non-terminal node: So what you actually end up with is this tree: This is much better than what we started with, but it is not quite as good as it can be. Those P1 nodes represent the mechanics of the operator precedence rules, but they don’t really represent anything meaningful in the document’s grammatical structure. We’d like to remove those as well, and we’ll see how that can be done with the other pruning mode. Pruning Based on Name The other pruning mode supported by our parsing engine is “based on name” pruning and it allows you to decide which nodes to exclude from the syntax tree by naming them in a certain way. Specifically, any node with an underscore as the first character of its name will be hidden automatically. For example, if I wanted to re-write the addition and multiplication grammar from the previous post to prevent the “P…” nodes from ever appearing in the syntax tree, I could have prefaced their names with an underscore, like so: Expression = _p3; _p3 = _p2 | AdditionExpression; AdditionExpression = _p3, '+', _p2; _p2 = _p1 | MultiplicationExpression; MultiplicationExpression = _p2, '*', _p1; _p1 = id; Now the syntax tree for the expression x + y * z will look like this when both pruning modes are used: And this is as sparse as a concrete syntax tree can get while still representing the original expression. So the “based on name” pruning mode allows you to create non-terminal symbols to represent the mechanics of parsing your grammar without exposing them in the final syntax tree. When we first added the “based on children” pruning mode, I’ll admit there was a natural tendency to want to create “helper” non-terminal symbols to represent sequences which are repeated over and over again. This is ok to do in most cases, but be careful, because it can cause two problems. This first is not a huge problem most of the time, but it is still something to keep in mind: performance. Even though these hidden non-terminal symbols are never seen in the syntax tree, the parser still needs to perform a reduce operation, which would normally create a non-terminal node to be the parent of the reduced child nodes. That reduce operation takes time. If common expressions are made to be children of these helper non-terminal symbols, those extra reduce operations could add up and might cause a noticeable slowdown for large documents. However, the parser has been highly optimized, so these reduce operations are done very quickly and in most cases you shouldn’t see a slowdown. And just to clarify: the reduce operation for a helper non-terminal symbol is not any slower than that of a normal non-terminal symbol's reduce operation, so if the common expression was already the child of a non-terminal symbol, making that symbol hidden by using the underscore will not add any overhead. The second potential issue with these helper non-terminal symbols is much more important: they could lead to more local ambiguities occurring during a parse, which could easily lead to a noticeable performance impact. An example will probably help illustrate this potential problem. Let’s say we have the following grammar rules: X = a, b, c; Y = a, b, d; Z = [a], b, c, d; There are common sequences there, highlighted in bold, which are repeated. It would probably make more sense to define the sequence once using a helper non-terminal symbol and then just reference it in each place needing that sequence: X = _commonPrefix, c; Y = _commonPrefix, d; Z = [a], b, c, d; _commonPrefix = a, b; This cleans up the grammar a bit and it will still produce the exact same syntax tree as the previous grammar when parsing the same document. However, this grammar will produce local ambiguities that the previous grammar will not. We have added a shift/reduce conflict here. Let’s suppose the parser has already seen tokens for symbols ‘a’ and ‘b’. Then if it sees a token for symbol ‘c’, it doesn’t know whether to shift to the next position in the Z non-terminal symbol or reduce the _commonPrefix non-terminal symbol, so it does both. This is essentially the opposite of the substitution strategy we used to remove ambiguities a few posts back. The main problem with this helper non-terminal symbol is that the sequence “a b” could have been part of another non-terminal symbol, but that sequence could not be replaced by the _commonPrefix because the “a” was optional. So if you are going to use helper non-terminal symbols to organize the grammar and remove duplicated patterns, make sure the helpers do not introduce new conflicts/ambiguities into the grammar. There is another use for these helper non-terminal symbols in addition to EBNF cleanliness. They can help to lower the complexity of non-terminal symbols (measured by the number of productions with the non-terminal symbol as their head). EBNF allows you to define optional and repetition sequences pretty concisely, but it can also hide the complexity generated by those sequences. The parsing engine always needs to process grammars in terms of their productions, each of which specifies a single sequence of symbols that a non-terminal symbol can represent. So a simple EBNF rule like S = A, B; has a single production for the parsing engine to consider: S → A B. However, if the A is optional, as in S = [A], B;, there are now two productions because the A can either be present or omitted: S → B S → A B If both the A and B are optional, as in S = [A], [B];, there are four productions: S → S → A S → B S → A B This is an exponentially growing problem. A grammar rule with N optional sequences concatenated together will have 2N productions created. Our parsing engine only allows for 65,536 productions per grammar, so all it would take is a grammar rule with 16 optional symbols concatenated together to hit the maximum number of productions. So how can helper non-terminals…umm…help here? Well, let’s take a look at this extreme example: S = [A], [B], [C], [D], [E], [F], [G], [H], [I], [J], [K], [L], [M], [N], [O], [P]; This single grammar rule has 16 optional sequences concatenated together, and so it produces 216, or 65,536 productions. The problem here is that each option compounds on the complexity of the other options before it. So if we split up the options to be in separate non-terminal symbols, they won’t compound on each other as much. We can do this and still produce an identical syntax tree by using a pruned non-terminal symbol: S = [A], [B], [C], [D], [E], [F], [G], [H], _theRest; _theRest = [I], [J], [K], [L], [M], [N], [O], [P]; And now the S non-terminal only has 8 optional sequences so it produces 28, or 256 productions. But so does _theRest. So together, this grammar produces 512 productions, which is much less than the 65,536 we started with. What we essentially did was take something like this: 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 65536 And we replaced one of the multiplications with an addition: 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 + 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 512 This problem of complexity, and the assistance that helper non-terminal symbols can offer to reduce it, is not limited to optional rules. Alternation rules have the same problem. So, for example, this rule has 3 choices concatenated with 4 choices, which creates 3 * 4, or 12 productions: S = (A | B | C), (D | E | F | G); And when the choices are in optional brackets, another choice is added to the group, because one choice is to exclude all of them: S = (A | B | C), [D | E | F | G]; Now the second group in this rule has 5 choices (D, E, F, G, or nothing). This combined with the 3 choices in the first group results in 15 total productions. Helper non-terminal symbols can reduce complexity here as well if they are used to take the place of some of the more complex groupings: S = (A | B | C), _secondGroup; _secondGroup = [D | E | F | G]; Now there are 3 productions for S and 5 productions for _secondGroup resulting in 8 productions in the grammar instead of 15. We took 3 * 5 and changed it to 3 + 5. In 13.1, we’ve added a Grammar.Analyze() method which, among other things, will warn about non-terminal symbols being too complex. By default, it will warn about non-terminal symbols which create over 100 productions, but that threshold can be specified via options to the method. This method also helps detect shift/reduce and reduce/reduce conflicts, but I’ll go more into detail on that in a future post. Preventing Nodes from Being Pruned These pruning modes can help in most cases, but sometimes they aren’t always helpful. You may have certain non-terminal symbols which do have meaning in the grammatical structure of the document, but which own a single non-terminal symbol, or which have an underscore at the start of their name. For example, in a grammar to describe the Visual Basic .NET language, these might be some of the rules that describe method arguments: Argument = SimpleArgument | OmittedArgument | NamedArgument; SimpleArgument = Expression; OmittedArgument = ; NamedArgument = IdentifierToken, ColonEqualsToken, Expression; Based on this definition, Argument and SimpleArgument will never have nodes in the syntax tree if “based on children” pruning is used because they can only be formed by being the parent of another non-terminal node. This may be acceptable, or it may be the case that your logic needs to know when it is processing an argument expression and so you need the SimpleArgument, OmittedArgument, or NamedArgument to be included in the syntax tree. To make sure that SimpleArgument never gets pruned from the syntax tree, you can set the NonTerminalSymbol.PreventPruning property to true on the SimpleArgument symbol. This can also be done in EBNF like so: ?<NonTerminalSymbolOptions PreventPruning="true" />? SimpleArgument = Expression; Note the new special sequence syntax for specifying NonTerminalSymbol properties in 13.1 Now the SimpleArgument nodes will always appear in the syntax tree when they are parsed. Next time we’ll get back to ambiguities and talk about the dangling else ambiguity and how letting it be ambiguous might be the best solution to produce a better syntax tree structure. By Mike Dour The information contained herein is provided for reference purposes only. Copyright © 2013 Infragistics, Inc., All Rights Reserved.