Infragistics Parsing Framework - Removing Ambiguities, Part 6

Mike Dour / Monday, July 8, 2013

This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here.

In the previous article, we saw the dangling else ambiguity and how letting the grammar be ambiguous in that case can produce a cleaner syntax tree when parsing a document. But in most cases, you will probably want to remove ambiguities from a grammar. Or, as in the case of the intentionally ambiguous dangling else grammar, you will definitely want to tell the parser which syntax tree to use in the case of a global ambiguity (recall that global ambiguities arise when multiple sub-trees can correctly represent the same portion of textual content). In the dangling else grammar, we were able to do this with the HasPriority setting of a non-terminal symbol, but that may not always be possible. In some cases, you will need to use custom code to determine which sub-tree is preferable.

To facilitate in these tasks, we have added a few helpful APIs which I will discuss in the order you will probably want to use them. The first thing you will want to do while creating your grammar is locate and remove as many ambiguities as possible without overly complicating the grammatical structure of documents being parsed. However, locating these ambiguities by simply reading the grammar can be extremely difficult for any medium or large scale grammar. So to aid in this effort, the Grammar object now has an Analyze method which will identify, among other things, ambiguities in the grammar. This method returns an analysis result object which has a collection of GrammarWarning instances. Each warning has a Message and a Type property, the latter of which will be the value GrammarWarningType.Ambiguity for the warnings I am going to discuss here.

Let’s see what kind of warning is generated when we analyze the dangling else ambiguity from the previous post. I have added basic definitions for IdentifierToken and Expression and I have also moved literal string terminal symbols into formal TerminalSymbol definitions so the warning’s message can use the symbol names and therefore be clearer. In addition, I have removed the HasPriority setting for reasons I will explain later.

?
<Grammar Name="Dangling Else">
     <!-- Include built-in symbols supported in this langauge -->
     <TerminalSymbolReference Name="NewLineToken"/>
     <TerminalSymbolReference Name="WhitespaceToken"/>

     <!-- Define custom symbols -->
     <TerminalSymbol Name="ElseKeyword" Value="else" />
     <TerminalSymbol Name="IfKeyword" Value="if" />

     <TerminalSymbol Name="IdentifierToken"
         Value="(?i)[a-z_][a-z_0-9]*" Comparison="RegularExpression" />
     <TerminalSymbol Name="OpenBraceToken" Value="{" />
     <TerminalSymbol Name="CloseBraceToken" Value="}" />
     <TerminalSymbol Name="OpenParenToken" Value="(" />
     <TerminalSymbol Name="CloseParenToken" Value=")" />
     <TerminalSymbol Name="SemicolonToken" Value=";" />
</Grammar>
?

Statement =
     MethodInvocation
     | Block
     | IfStatement;
MethodInvocation =
     IdentifierToken, OpenParenToken, CloseParenToken, SemicolonToken;
Block =
     OpenBraceToken, {Statement}, CloseBraceToken;

IfStatement =
     _ifStatementWithElse
     | _ifStatementWithoutElse;

_ifStatementWithElse =
     IfKeyword, OpenParenToken, Expression, CloseParenToken, Statement, ElseKeyword, Statement;
_ifStatementWithoutElse =
     IfKeyword, OpenParenToken, Expression, CloseParenToken, Statement;

Expression =
     IdentifierToken;

When we create a Grammar instance with this EBNF content and call the Analyze method on it, a single warning is generated with a Type of Ambiguity and a Message with the following text:

An ambiguity will occur during a specific configuration of the parser when the 'ElseKeyword' symbol is detected.
    The parser configuration occurs when it is at the '@' character during creation of the following productions:
        _ifStatementWithoutElse → IfKeyword OpenParenToken Expression CloseParenToken Statement @
        _ifStatementWithElse → IfKeyword OpenParenToken Expression CloseParenToken Statement @ElseKeyword Statement
    In this configuration, when the 'ElseKeyword' symbol is detected the parser must try to do all of the following:
        Option 1 - (Shift) Continue or start constructing the following productions from the '@':
            _ifStatementWithElse → IfKeyword OpenParenToken Expression CloseParenToken Statement @ElseKeyword Statement
        Option 2 - (Reduce) Create a parent node for the '_ifStatementWithoutElse' symbol because it can be followed by the 'ElseKeyword' symbol.

This is quite a mouthful, but it has all the necessary information to describe why there is a shift/reduce conflict here. The first line tells us what symbol will cause the conflict to occur: the ElseKeyword symbol, which represents the literal text ‘else’. The next few lines describe the state in which the parser must be for the ElseKeyword to cause the conflict. In this case, if the parser has already read the symbols IfKeyword, OpenParenToken, Expression, CloseParenToken, and Statement, in that order, it knows it can either be at the end of the _ifStatementWithoutElse symbol or before the ElseKeyword in the _ifStatementWithElse symbol. And the remaining lines describe why this situation causes a conflict. It lists all the possible choices the parser must try. In this case, the parser has the option to perform a shift operation and move to the next position in _ifStatementWithElse or perform a reduce operation by completing the _ifStatementWithoutElse symbol and making it a parent node of the symbols which have already been read. The reason why the reduce operation is possible is because the _ifStatementWithoutElse symbol can be followed by the ElseKeyword, because Statement can be followed by ElseKeyword and _ifStatementWithoutElse is a type of Statement*.

However, although this warning is reported, we have already decided that letting the ambiguity exist in this grammar is probably a better thing to do to keep the grammatical structure cleaner. So in that case, the next thing you will probably want to do is suppress the warning message so you never see it again for this situation. This can be done by setting SuppressAmbiguityWarnings on the _ifStatementWithoutElse non-terminal symbol to true. This setting tells the Grammar to ignore reduce operations for the non-terminal symbol when looking for ambiguities. It this case, when it ignores the reduce operation, the analysis only thinks that a shift is possible and so a warning is not reported.

Once you disable a warning for a particular ambiguity, you will want to resolve it in some way if it is a global ambiguity. This is so that when the parser finds multiple sub-trees that can represent a portion of text, it doesn’t arbitrarily pick one. As we have already seen, resolving ambiguities can be done with the HasPriority setting on non-terminal symbols, but knowing when the setting is necessary is not always obvious. In addition, sometimes the setting cannot resolve all ambiguities correctly. To help with this, we have added an OnGlobalAmbiguityDetected virtual method on LanguageBase which can be overridden in generated language classes to write code to detect and handle global ambiguities. And if you are using a CustomLanguage instance instead of a generated language class, CustomLanguage has a GlobalAmbiguityDetected event which serves the same purpose as the virtual method. Both the virtual method and the event take a GlobalAmbiguityContext object which describes an unhandled global ambiguity and allows you to specify which sub-tree is preferred in the current ambiguity. So as the parser is analyzing syntax, if it finds it can represent a segment of text in two or more ways, all of which could be valid in the overall document, and none of which have priority over the others via a HasPriority setting, the OnGlobalAmbiguityDetected method will be called to let you determine how to handle it.

The GlobalAmbiguityContext has two properties: SubTreeRoot1 and SubTreeRoot2, which are root nodes of two sub-trees representing the same textual content. If the content can be represented in more than two ways, OnGlobalAmbiguityDetected will be called multiple times until only one alternative remains. The GlobalAmbiguityContext also has a PreferredSubTreeRoot property which can be set to the node from SubTreeRoot1 or SubTreeRoot2. If it is not set, the parser will arbitrarily pick one of the sub-trees to use. So obviously, if any custom code is needed to disambiguate a parse or if you want to just detect when an unhandled global ambiguity has occurred, you must override OnGlobalAmbiguityDetected or hook the associated event.

If you don’t expect any unhandled global ambiguities, you can override OnGlobalAmbiguityDetected and debug assert when it is called. However, if you take this approach, I recommend only debug asserting when the IsErrorHandling property of the GlobalAmbiguityContext is false. If it is true, the parser has encountered a syntax error in the text and could be trying to insert or skip various tokens to recover from the error and continue parsing. These experiments by the parser could cause unhandled ambiguities to occur even in cases where they are never possible in a syntactically correct document. For example, the parser could simultaneously try to parse the text in two different ways by inserting different tokens in the same position of the text. These two resulting modifications could be ambiguous with each other because the parser is now analyzing two different pieces of text, but that ambiguity would never be possible to achieve when parsing a well formed document by itself.

So let’s see how this would work with the dangling else grammar from above, which has had the priority setting removed to allow the notification about an unhandled global ambiguity to come through. I wrote a short program to load the grammar from above (the source code is below) and parse the following ambiguous document:

if (x)

       if (y)

              A();

       else

              B();

The program also hooks into the GlobalAmbiguityDetected event of the CustomLanguage instance made from the grammar definition and prints out a textual representation of the node structure for each sub-tree. Here is how the output looks when running the program:

(Sub-tree 1)
IfStatement
    IfKeyword
    Expression
        IdentifierToken: x
    Statement
        IfStatement
            IfKeyword
            Expression
                IdentifierToken: y
            Statement
                MethodInvocation
                    IdentifierToken: A
            ElseKeyword
            Statement
                MethodInvocation
                    IdentifierToken: B
---------------------------
(Sub-tree 2)
IfStatement
    IfKeyword
    Expression
        IdentifierToken: x
    Statement
        IfStatement
            IfKeyword
            Expression
                IdentifierToken: y
            Statement
                MethodInvocation
                    IdentifierToken: A
    ElseKeyword
    Statement
        MethodInvocation
            IdentifierToken: B

As you can see, these are the two interpretations we saw in the previous post. In in the first sub-tree, the else clause is associated with the nested if statement. In the second sub-tree, it is associated with the outer if statement. Obviously, the first sub-tree is preferred, so we could detect this with code in the event handler and set that tree’s root node as the PreferredSubTreeRoot on the context object.

Here is the source code for the program which parses the document and prints the ambiguous syntax trees:

using System.Diagnostics;
using System.Linq;
using Infragistics.Documents;
using Infragistics.Documents.Parsing;

class Program
{
    static void Main(string [] args)
    {
        var result = Grammar.LoadEbnf(
@"
?
<Grammar Name=""Dangling Else"">
    <!-- Include built-in symbols supported in this langauge -->
    <TerminalSymbolReference Name=""NewLineToken""/>
    <TerminalSymbolReference Name=""WhitespaceToken""/>

    <!-- Define custom symbols -->
    <TerminalSymbol Name=""ElseKeyword"" Value=""else"" />
    <TerminalSymbol Name=""IfKeyword"" Value=""if"" />

    <TerminalSymbol Name=""IdentifierToken""
        Value=""(?i)[a-z_][a-z_0-9]*"" Comparison=""RegularExpression"" />
    <TerminalSymbol Name=""OpenBraceToken"" Value=""{"" />
    <TerminalSymbol Name=""CloseBraceToken"" Value=""}"" />
    <TerminalSymbol Name=""OpenParenToken"" Value=""("" />
    <TerminalSymbol Name=""CloseParenToken"" Value="")"" />
    <TerminalSymbol Name=""SemicolonToken"" Value="";"" />
</Grammar>
?

Statement =
    MethodInvocation
    | Block
    | IfStatement;
MethodInvocation =
    IdentifierToken, OpenParenToken, CloseParenToken, SemicolonToken;
Block =
    OpenBraceToken, {Statement}, CloseBraceToken;

IfStatement =
    _ifStatementWithElse
    | _ifStatementWithoutElse;

_ifStatementWithElse =
    IfKeyword, OpenParenToken, Expression, CloseParenToken, Statement, ElseKeyword, Statement;
_ifStatementWithoutElse =
    IfKeyword, OpenParenToken, Expression, CloseParenToken, Statement;

Expression =
    IdentifierToken;

");

        if (result.Success == false)
        {
            // Look at result.Errors collection to see what the problem is.
            return;
        }

        var grammar = result.Grammar;
        var language = new CustomLanguage(grammar);
        language.GlobalAmbiguityDetected += language_GlobalAmbiguityDetected;

        var document = new TextDocument();
        document.Language = language;
        document.InitializeText(
@"
if(x)
       if (y)
              A();
       else
              B();
");

        document.Parse();
    }

    private static void language_GlobalAmbiguityDetected(
        object sender, GlobalAmbiguityDetectedEventArgs e)
    {
        Debug.WriteLine("(Sub-tree 1)");
        DebugPrintTree(e.Context.SubTreeRoot1);
        Debug.WriteLine("---------------------------");
        Debug.WriteLine("(Sub-tree 2)");
        DebugPrintTree(e.Context.SubTreeRoot2);
    }

    public static void DebugPrintTree(SyntaxNode node)
    {
        if (node.IsTokenNode)
        {
            DebugPrintToken(node);
            return;
        }

        Debug.WriteLine(node.Symbol.Name);

        Debug.Indent();

        // Recursively print all sub-trees
        foreach (var childNode in node.Children())
            DebugPrintTree(childNode);

        Debug.Unindent();
    }

    private static void DebugPrintToken(SyntaxNode node)
    {
        switch (node.Symbol.Name)
        {
            case "OpenBraceToken":
            case "CloseBraceToken":
            case "OpenParenToken":
            case "CloseParenToken":
            case "SemicolonToken":
                return;

            case "ElseKeyword":
            case "IfKeyword":
                Debug.WriteLine(node.Symbol.Name);
                return;
        }

        Debug.WriteLine(node.Symbol.Name + ": " + node.GetText());
    }
}

 

* This is a bit of a simplification. The real reason why _ifStatementWithoutElse can be followed by an ElseKeyword is because _ifStatementWithoutElse can be at the end of a Statement symbol and then Statement can be followed by ElseKeyword. But in this case, _ifStatementWithoutElse would be the entire content of the Statement – it would be at the start and end of it, so it is just easier to think of it as a type of Statement.

By Mike Dour

The information contained herein is provided for reference purposes only.

Copyright © 2013 Infragistics, Inc., All Rights Reserved.