EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$

    Syntax for Programming Languages


    Contents
    OverviewL

    While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language.

    Context-free Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the context-free languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFG-like definition of a language and recognize whether or not a given string of input tokens belongs that language.

    For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a context-free grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations.

    Reality Check!

    We'll caveat the claim that "CFGs are a good way to define programming-language syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes.

    Ambiguous GrammarsL

    The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference):

    P1: Exp $\longrightarrow$ intlit
    P2: Exp $\longrightarrow$ Exp minus Exp
    P3: Exp $\longrightarrow$ Exp divide Exp
    P4: Exp $\longrightarrow$ lparen Exp rparen

    Consider the string

    1 - 4 / 2

    which would be tokenized as

    intlit minus intlit divide intlit )

    It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence:

    Step #1. $Exp$ $\Longrightarrow$ $Exp$ minus $Exp$ (via production P2)
    Step #2.    $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ (via production P1)
    Step #3.    $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ divide $Exp$ (via production P3)
    Step #4.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide $Exp$ (via production P1)
    Step #5.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P1)

    This derivation sequence corresponds to the parse tree:

    image/svg+xml Exp Exp Exp Exp Exp intlit 1 minus "-" divide "/" intlit 4 intlit 2 Step #1apply P2 Step #3apply P3 Step #5 apply P1 Step #4. apply P1 Step #2apply P1

    However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently:

    Step #1. Exp $\Longrightarrow$ Exp divide Exp (via production P3)
    Step #2.    $\;\;\;\;\;\Longrightarrow$ Exp minus Exp divide Exp (via production P2)
    Step #3.    $\;\;\;\;\;\Longrightarrow$ intlit minus Exp divide Exp (via production P1)
    Step #4.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide Exp (via production P1)
    Step #5.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P1)

    Which corresponds to the parse tree:

    image/svg+xml Exp Exp Exp Exp Exp intlit 2 divide "/" minus "-" intlit 1 intlit 4 Step #1apply P3 Step #2apply P2 Step #4 apply P1 Step #3. apply P1 Step #5apply P1

    If for grammar $G$ and string $w$ there is:

    • more than one leftmost derivation of $w$ or,
    • more than one rightmost derivation of $w$, or
    • more than one parse tree for $w$

    then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.)

    It is worth noting that ambiguity is not a problem for certain applications of Context-Free Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues.

    A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is ill-defined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 1-4. These two groupings correspond to expressions with different values: the first tree corresponds to (1-(4/2)) and the second tree corresponds to ((1-4)/2).

    A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of Big-O complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P1 can be applied any number of times before producing the integer literal token.

    P1: Exp $\longrightarrow$ Exp
    P2: Exp $\longrightarrow$ intlit

    Disambiguating GrammarsL

    It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators.

    PrecedenceL

    To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:

    P1: Exp $\longrightarrow$ Exp  minus  Exp
    P2: Exp $\longrightarrow$ Term
    P3: Term $\longrightarrow$ Term  divide  Term
    P4: Term $\longrightarrow$ Factor
    P5: Factor $\longrightarrow$ intlit
    P6: Factor $\longrightarrow$ lparen  Exp  rparen

    Now let's try using these new rules to build parse trees for

    1 - 4 / 2.

    First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:

    image/svg+xml Factor intlit 4 Factor intlit 2 #8: apply P5 #10: apply P5 Exp Exp Exp Term Term minus "-" divide "/" #1: apply P3 #6: apply P3 Term Factor intlit 1 Term #2: apply P2 #3: apply P4 #4: apply P5 #5: apply P2 #7: apply P4 #9: apply P4

    Now we'll try to construct a parse tree that shows the wrong precedence:

    image/svg+xml Factor intlit 2 #??: apply P5 Term Term Term divide "/" #2: apply P3 #3: can't get to P1 #??: apply P4 Exp #1: apply P2 Stuck!
    Associativity

    This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).


    TEST YOURSELF #1

    Draw two parse trees for the expression 5-3-2 using the current expression grammar:

      exp exp MINUS exp | term
      term
      term DIVIDE term | factor
      factor
      INTLITERAL | LPAREN exp RPAREN
    One of your parse trees should correctly group 5-3, and the other should incorrectly group 3-2.

    solution


    To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars.
    • A grammar is recursive in nonterminal $X$ if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that includes an $X$).
    • A grammar is left recursive in $X$ if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that starts with an $X$).
    • A grammar is right recursive in $X$ if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps, $X$ derives a sequence of symbols that ends with an $X$).

    The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this).

    To write a grammar that correctly expresses operator associativity:

    • For left associativity, use left recursion.
    • For right associativity, use right recursion.

    Here's the correct grammar:

    Exp $\longrightarrow$ Exp minus Term  |  Term
    Term $\longrightarrow$ Term divide Factor  |  Factor
    Factor $\longrightarrow$ intlit  |  lparen Exp rparen

    And here's the (one and only) parse tree that can be built for 5 - 3 - 2 using this grammar:

    Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit.

    Here's an ambiguous context-free grammar for this language:

    Exp  $\longrightarrow$  Exp plus Exp | Exp minus Exp | Exp times Exp | Exp divide Exp
    | Exp pow Exp | lparen Exp rparen | intlit

    First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:

    exp exp PLUS exp | exp MINUS exp | term
    term term TIMES term | term DIVIDE term | factor
    factor factor POW factor | exponent
    exponent INTLITERAL | LPAREN exp RPAREN

    This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):

    exp exp PLUS term | exp MINUS term | term
    term term TIMES factor | term DIVIDE factor | factor
    factor exponent POW factor | exponent
    exponent INTLITERAL | LPAREN exp RPAREN

    Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., -3**4 is equivalent to: (-3)**4, not to -(3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.

    exp exp PLUS term | exp MINUS term | term
    term term TIMES factor | term DIVIDE factor | factor
    factor exponent POW factor | exponent
    exponent MINUS exponent | final
    final INTLITERAL | LPAREN exp RPAREN


    TEST YOURSELF #2

    Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not:

    BExp$\;\longrightarrow\;$true
    BExp$\;\longrightarrow\;$false
    BExp $\;\longrightarrow\;$ BExp or BExp
    BExp $\;\longrightarrow\;$ BExp and BExp
    BExp $\;\longrightarrow\;$ not BExp
    BExp $\;\longrightarrow\;$ lparen BExp rparen

    Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative.

    Question 2: Draw a parse tree (using your final grammar for Question 1) for the expression: true and not true.

    solution


    List Grammars

    Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language.

    • One or more PLUSes (without any separator or terminator). (Remember, any of the following three grammars defines this language; you don't need all three lines).
      1. xList $\longrightarrow$ PLUS | xList xList
      2. xList $\longrightarrow$ PLUS | xList PLUS
      3. xList $\longrightarrow$ PLUS | PLUS xList
    • One or more runs of one or more PLUSes, each run separated by commas:
      1. xList $\longrightarrow$ PLUS | xList COMMA xList
      2. xList $\longrightarrow$ PLUS | xList COMMA PLUS
      3. xList $\longrightarrow$ PLUS | PLUS COMMA xList
    • One or more PLUSes, each PLUS terminated by a semi-colon:
      1. xList $\longrightarrow$ PLUS SEMICOLON | xList xList
      2. xList $\longrightarrow$ PLUS SEMICOLON | xList PLUS SEMICOLON
      3. xList $\longrightarrow$ PLUS SEMICOLON | PLUS SEMICOLON xList
    • Zero or more PLUSes (without any separator or terminator):
      1. xList $\longrightarrow$ $\varepsilon$ | PLUS | xList xList
      2. xList $\longrightarrow$ $\varepsilon$ | PLUS | xList PLUS
      3. xList $\longrightarrow$ $\varepsilon$ | PLUS | PLUS xList
    • Zero or more PLUSes, each PLUS terminated by a semi-colon:
      1. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList xList
      2. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList PLUS SEMICOLON
      3. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | PLUS SEMICOLON xList
    • The trickiest kind of list is a list of zero or more x's, separated by commas. To get it right, think of the definition as follows:
        Either an empty list, or a non-empty list of x's separated by commas.
      We already know how to write a grammar for a non-empty list of x's separated by commas, so now it's easy to write the grammar:
      xList $\longrightarrow$ $\varepsilon$ | nonemptyList
      nonemptyList $\longrightarrow$ PLUS | PLUS COMMA nonemptyList

    A Grammar for a Programming Language

    Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some top-level nonterminals:

    Program $\longrightarrow$ FunctionList
    FunctionList $\longrightarrow$ Function  | Function