Infragistics Parsing Framework - Removing Ambiguities, Part 4

Mike Dour / Thursday, March 21, 2013

This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here and the next post can be found here.

Defining binary operator grammar rules (as well as unary, ternary, and other arities) can be tricky and if not done correctly, can lead to global ambiguities all over the place. Let’s say we want to define a grammar for arithmetic expressions. And in this grammar, we only need to support addition and the only things which can be added are identifiers. Identifiers will be terminal symbols with the name “id”. One way to write the rules for this grammar would be as follows:

Expression =

id

| Expression, '+', Expression;

Now when presented with the text “x + y”, the following syntax tree would be generated:

However, if two or more additions are chained, the text becomes ambiguous. “x + y + z” could be interpreted in two different ways:

Now, other than the performance hit incurred when parsing ambiguities, this isn’t really a problem because addition is associative. The syntax tree on the left has the semantics of “(x + y) + z” and the syntax tree on the right has the semantics of “x + (y + z)”, both of which evaluate to the same result. So it would be ok to let the parser arbitrarily pick one of the trees. However, if we try to add support for multiplication, things get more complicated:

Expression =

id

| Expression, '+', Expression

| Expression, '*', Expression;

Now if the parser tries to analyze the text “x + y * z”, it will again create two ambiguous trees:

Now only one of these trees is valid. The syntax tree on the left has the semantics of “(x + y) * z”, which is incorrect, whereas the syntax tree on the right has the correct semantics of “x + (y * z)”. Therefore, when parsing arithmetic expressions, we always want the lower precedence operations to “own” the higher precedence operations. The nodes representing the lower precedence operations should appear higher in the syntax tree. One way to make sure the syntax tree always has this trait is to separate out the addition operation into its own non-terminal symbol and then mark it with HasPriority:

Expression =

id

| AdditionExpression

| MultiplicationExpression;

?<HasPriority>true</HasPriority>?;

AdditionExpression = Expression, '+', Expression;

MultiplicationExpression = Expression, '*', Expression;

However, this doesn’t prevent a global ambiguity from occurring. This just helps the parser decide which ambiguous tree should be chosen when the ambiguity does occur. So the performance hit is still incurred. Also, this only works with two precedence levels. Once there are three or more, there is no way to specify a precedence level for a non-terminal symbol. A non-terminal symbol either has priority or it doesn’t.

So what is the solution? The better way to do this is to write the rules in such a way so that there are no ambiguities and the lower precedence operations automatically own the higher precedence operations:

Expression =

P3;

P3 =

P2

| AdditionExpression;

AdditionExpression =

P3, '+', P2;

P2 =

P1

| MultiplicationExpression;

MultiplicationExpression =

P2, '*', P1;

P1 = id;

Now when parsing the text “x + y * z”, only one tree can be generated:

Although it is a bit denser than the original trees, this syntax tree represents the correct semantics of the expression. The incorrect tree from before, which had a multiplication expression “owning” an addition expression, can never be generated with this grammar.

To understand why the incorrect tree is impossible, it is important to understand some general rules used here. First of all, each arithmetic expression has as its arguments other expressions at its precedence level or a higher precedence level. This ensures that the lower precedence operations “own” the higher precedence operations. Second, each “P_” rule represents expressions at that precedence level and higher. So P3 could be anything at P3, P2, or P1. P2 could be anything at P2 or P1. And a P1 can only be something at the P1 level, which is only “id” in this grammar. This is why the root “Expression” symbol represents a P3, because an expression can be any arithmetic operation or just an identifier by itself. It can represent something at any precedence level.

Now it should be a bit clearer why an incorrect syntax tree is impossible to generate. The operands of a MultiplicationExpression can only be “P2” or “P1” level expressions, which are higher in precedence than the AdditionExpression (one of the “P3” expressions). There is no way for a P2 or P1 expression to represent an AdditionExpression non-terminal symbol, so an addition expression can never be one of the operands to a multiplication expression.

Another thing to note here is that “x + y + z” also only has one syntax tree which can be created for it. An AdditionExpression is defined as “P3 + P2”. Since the left operand is “P3” and the right operand is “P2”, the additions must always be parsed as “(x + y) + z”. As we already saw, P2 is higher in precedence than the addition expression, so an AdditionExpression can never be the right operand of another AdditionExpression, which would be required to parse this as “x + (y + z)”. So in addition to disambiguating precedence levels, this grammar also ensures that addition and multiplication are always left-associative. If we wanted to make addition right-associative, we could have defined it with “P2” as the left operand and “P3” and the right operand, like this: “P2 + P3”.

This type of thing can be done for any number of precedence levels by using the same general rules. If the binary expression should be left-associative:

LowerPrecedenceExpression =

HigherPrecedenceExpression

| LowerPrecedenceExpression, op, HigherPrecedenceExpression;

And if the binary expression should be right-associative:

LowerPrecedenceExpression =

HigherPrecedenceExpression

| HigherPrecedenceExpression, op, LowerPrecedenceExpression;

And as shown in the example above, each binary expression can be defined as a separate non-terminal symbol and just referred to by the rule which owns all the operations with the same precedence level. I want to show a larger example with more expressions, but before I do, I just want to mention unary operators. Obviously associativity doesn’t need to be disambiguated with unary operators because there is only one way to associate the operator with the operand. However, you still need to decide whether unary operators can be chained. For example, is “-+-+---5” a valid expression in the grammar or not? If unary operators can be chained, the general rule looks like this:

LowerPrecedenceExpression =

HigherPrecedenceExpression

| op, LowerPrecedenceExpression;

And if they cannot be chained, it looks like this:

LowerPrecedenceExpression =

HigherPrecedenceExpression

| op, HigherPrecedenceExpression;

These examples show a right-associative operator, but they work just the same with left-associative operators except the “op” goes after the sub-expression instead of before it.

Also, before I show the larger example, I should probably come up with some name for this general design pattern for grammar rules, because I might discuss it again in the future. It’s been pretty useful in designing grammars for our built in languages, and not just for arithmetic expressions either. It has even been helpful for things like chained join or union operations in the T-SQL language, which will be available in 13.1. I am not aware of this pattern having a name already, but if anyone thinks I’m wrong, please let me know. Since I’m not that creative, I’ve been referring to this pattern as “operator precedence rules”, which is not to be confused with an operator precedence grammar.

So let’s put it all together and look at a grammar which parses various expressions and allows them to be assigned to a variable. The first thing we need to do is define the precedence levels and associativity of our operators:

  1. Identifier, Constant, (…) – Primary expressions
  2. +, -  – Unary plus, minus (right-associative)
  3. ^ – Exponent (right-associative)
  4. *, / – Multiply, divide (left-associative)
  5. +, - – Binary plus, minus (left-associative)

And now we are ready to write our operator precedence rules:

AssignmentExpression =

id, '=', Expression;

Expression =

P5;

(* P5 represents level 5 expressions or higher, the operands of which are level 5 expressions or higher *)

P5 =

P4

| AddExpression

| SubtractExpression;

AddExpression =

P5, '+', P4;

SubtractExpression =

P5, '-', P4;

(* P4 represents level 4 expressions or higher, the operands of which are level 4 expressions or higher *)

P4 =

P3

| MultiplyExpression

| DivideExpression;

MultiplyExpression =

P4, '*', P3;

DivideExpression =

P4, '/', P3;

(* P3 represents level 3 expressions or higher, the operands of which are level 3 expressions or higher *)

P3 =

P2

| ExponentExpression;

ExponentExpression =

P2, '^', P3;     (* Note the right-associativity here *)

(* P2 represents level 2 expressions or higher, the operands of which are level 2 expressions or higher *)

P2 =

P1

| PlusExpression

| MinusExpression;

PlusExpression =

'+', P2;

MinusExpression =

'-', P2;

(* P1 represents level 1 expressions only, however the parenthesized expression could contain any level expression within the parentheses *)

P1 =

id

| constant

| ParenthesizedExpression;

ParenthesizedExpression =

'(', Expression, ')';

The only downside to this approach is that it produces a pretty dense syntax tree for simple expressions such as “x = y”:

Next time we’ll take a short detour from ambiguities and see how we can still use operator precedence rules but create a syntax tree which is sparser for simple expressions.

By Mike Dour

The information contained herein is provided for reference purposes only.

Copyright © 2013 Infragistics, Inc., All Rights Reserved.