EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$

Contents
OverviewL

While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language.

Context-free Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the context-free languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFG-like definition of a language and recognize whether or not a given string of input tokens belongs that language.

For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a context-free grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations.

Reality Check!

We'll caveat the claim that "CFGs are a good way to define programming-language syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes.

Ambiguous GrammarsL

The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference):

P1: Exp $\longrightarrow$ intlit
P2: Exp $\longrightarrow$ Exp minus Exp
P3: Exp $\longrightarrow$ Exp divide Exp
P4: Exp $\longrightarrow$ lparen Exp rparen

Consider the string

1 - 4 / 2

which would be tokenized as

intlit minus intlit divide intlit )

It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence:

Step #1. $Exp$ $\Longrightarrow$ $Exp$ minus $Exp$ (via production P2)
Step #2.    $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ (via production P1)
Step #3.    $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ divide $Exp$ (via production P3)
Step #4.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide $Exp$ (via production P1)
Step #5.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P1)

This derivation sequence corresponds to the parse tree:

However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently:

Step #1. Exp $\Longrightarrow$ Exp divide Exp (via production P3)
Step #2.    $\;\;\;\;\;\Longrightarrow$ Exp minus Exp divide Exp (via production P2)
Step #3.    $\;\;\;\;\;\Longrightarrow$ intlit minus Exp divide Exp (via production P1)
Step #4.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide Exp (via production P1)
Step #5.    $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P1)

Which corresponds to the parse tree:

If for grammar $G$ and string $w$ there is:

• more than one leftmost derivation of $w$ or,
• more than one rightmost derivation of $w$, or
• more than one parse tree for $w$

then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.)

It is worth noting that ambiguity is not a problem for certain applications of Context-Free Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues.

A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is ill-defined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 1-4. These two groupings correspond to expressions with different values: the first tree corresponds to (1-(4/2)) and the second tree corresponds to ((1-4)/2).

A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of Big-O complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P1 can be applied any number of times before producing the integer literal token.

P1: Exp $\longrightarrow$ Exp
P2: Exp $\longrightarrow$ intlit

Disambiguating GrammarsL

It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators.

PrecedenceL

To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:

P1: Exp $\longrightarrow$ Exp  minus  Exp
P2: Exp $\longrightarrow$ Term
P3: Term $\longrightarrow$ Term  divide  Term
P4: Term $\longrightarrow$ Factor
P5: Factor $\longrightarrow$ intlit
P6: Factor $\longrightarrow$ lparen  Exp  rparen

Now let's try using these new rules to build parse trees for

1 - 4 / 2.

First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:

Now we'll try to construct a parse tree that shows the wrong precedence:

Associativity

This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).

TEST YOURSELF #1

Draw two parse trees for the expression 5-3-2 using the current expression grammar:

exp exp MINUS exp | term
term
term DIVIDE term | factor
factor
INTLITERAL | LPAREN exp RPAREN
One of your parse trees should correctly group 5-3, and the other should incorrectly group 3-2.

solution

To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars.
• A grammar is recursive in nonterminal $X$ if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that includes an $X$).
• A grammar is left recursive in $X$ if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that starts with an $X$).
• A grammar is right recursive in $X$ if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps, $X$ derives a sequence of symbols that ends with an $X$).

The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this).

To write a grammar that correctly expresses operator associativity:

• For left associativity, use left recursion.
• For right associativity, use right recursion.

Here's the correct grammar:

Exp $\longrightarrow$ Exp minus Term  |  Term
Term $\longrightarrow$ Term divide Factor  |  Factor
Factor $\longrightarrow$ intlit  |  lparen Exp rparen

And here's the (one and only) parse tree that can be built for 5 - 3 - 2 using this grammar:

Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit.

Here's an ambiguous context-free grammar for this language:

 Exp $\longrightarrow$ Exp plus Exp | Exp minus Exp | Exp times Exp | Exp divide Exp | Exp pow Exp | lparen Exp rparen | intlit

First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:

 exp → exp PLUS exp | exp MINUS exp | term term → term TIMES term | term DIVIDE term | factor factor → factor POW factor | exponent exponent → INTLITERAL | LPAREN exp RPAREN

This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):

 exp → exp PLUS term | exp MINUS term | term term → term TIMES factor | term DIVIDE factor | factor factor → exponent POW factor | exponent exponent → INTLITERAL | LPAREN exp RPAREN

Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., -3**4 is equivalent to: (-3)**4, not to -(3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.

 exp → exp PLUS term | exp MINUS term | term term → term TIMES factor | term DIVIDE factor | factor factor → exponent POW factor | exponent exponent → MINUS exponent | final final → INTLITERAL | LPAREN exp RPAREN

TEST YOURSELF #2

Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not:

BExp$\;\longrightarrow\;$true
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen

Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative.

Question 2: Draw a parse tree (using your final grammar for Question 1) for the expression: true and not true.

solution

Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language.

• One or more PLUSes (without any separator or terminator). (Remember, any of the following three grammars defines this language; you don't need all three lines).
1. xList $\longrightarrow$ PLUS | xList xList
2. xList $\longrightarrow$ PLUS | xList PLUS
3. xList $\longrightarrow$ PLUS | PLUS xList
• One or more runs of one or more PLUSes, each run separated by commas:
1. xList $\longrightarrow$ PLUS | xList COMMA xList
2. xList $\longrightarrow$ PLUS | xList COMMA PLUS
3. xList $\longrightarrow$ PLUS | PLUS COMMA xList
• One or more PLUSes, each PLUS terminated by a semi-colon:
1. xList $\longrightarrow$ PLUS SEMICOLON | xList xList
2. xList $\longrightarrow$ PLUS SEMICOLON | xList PLUS SEMICOLON
3. xList $\longrightarrow$ PLUS SEMICOLON | PLUS SEMICOLON xList
• Zero or more PLUSes (without any separator or terminator):
1. xList $\longrightarrow$ $\varepsilon$ | PLUS | xList xList
2. xList $\longrightarrow$ $\varepsilon$ | PLUS | xList PLUS
3. xList $\longrightarrow$ $\varepsilon$ | PLUS | PLUS xList
• Zero or more PLUSes, each PLUS terminated by a semi-colon:
1. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList xList
2. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList PLUS SEMICOLON
3. xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | PLUS SEMICOLON xList
• The trickiest kind of list is a list of zero or more x's, separated by commas. To get it right, think of the definition as follows:
Either an empty list, or a non-empty list of x's separated by commas.
We already know how to write a grammar for a non-empty list of x's separated by commas, so now it's easy to write the grammar:
 xList $\longrightarrow$ $\varepsilon$ | nonemptyList nonemptyList $\longrightarrow$ PLUS | PLUS COMMA nonemptyList

Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some top-level nonterminals:

 Program $\longrightarrow$ FunctionList FunctionList $\longrightarrow$ Function  | Function FunctionList

With these three productions, the basic shape of the program is already taking form, and some important constraints are put in place:

• A program must be a list of functions (and nothing else); no statements may appear outside of a function (similar to C), nor may any variable declarations appear outside of a function (unlike C, which allows global variable declarations outside of functions).
• A program must contain at least one function, as the FunctionList cannot immediately derive $\varepsilon$.

We can now move on to specifying the syntax of a function: A Function is the word "function", optionally preceded by the word "public", followed by an identifier (signifying the name of the function), followed by an open curly brace, followed by the function body, followed by a closing curly brace:

 Function $\longrightarrow$ public function id lcurly FunctionBody rcurly | function id lcurly FunctionBody rcurly

Note that the above definition is hardly unique. Consider this alternative, but equivalent, definition for a function:

 Function $\longrightarrow$ Access function id lcurly FunctionBody rcurly Access $\longrightarrow$ public  |   $\varepsilon$

Note that the syntax of the language does not provide a list of arguments that the function takes (many popular programming languages expect that a function has a set of parenthesis before the function body).

A function body is a list of zero or more statements:

 FunctionBody $\longrightarrow$ $\varepsilon$ | StatementList StatementList $\longrightarrow$ Statement | Statement  StatementList

From here, we can define various statement types, and so on.

A benefit of the Context-Free Grammar is that a "sub-grammar" can be  plugged in  to allow complex parse trees to be grown at the non-terminal at which they are rooted. For example, we could allow the rule:

 Statement $\longrightarrow$ id equals Exp

(where equals is the symbol =, used here for assignment)

This grammar can then make use of the expression grammar that we defined above to build parse trees with arbitrarily complex mathematical expressions on the right hand side of an assignment statement.

Summary

To understand how a parser works, we start by understanding context-free grammars, which are used to define the language recognized by the parser. Important terminology includes:

• terminal symbol
• nonterminal symbol
• grammar rule (or production)
• derivation (leftmost derivation, rightmost derivation)
• parse (or derivation) tree
• the language defined by a grammar
• ambiguous grammar

Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both non-empty and possibly empty lists, and for lists both with and without separators and terminators.