EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$

Contents

Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples on this page we will use sometimes name terminal symbols using single characters (such as: (, ), [, and ]) instead of the token names (lparen, rparen, etc). Also note that in the picture, the top of stack is to the left.

Grammar:

 $S$ $\longrightarrow$ $\varepsilon$ $\;|\;$ ( $S$ ) $\;|\;$ [ $S$ ]

Parse Table:

 ( ) [ ] EOF $S$ ( $S$ ) $\varepsilon$ [ $S$ ] $\varepsilon$ $\varepsilon$

 Input seen so far stack Action ( $S$ EOF pop, push ( $S$ ) ( ( $S$ ) EOF pop, scan ([ $S$ ) EOF pop, push [ $S$ ] ([ [ $S$ ] ) EOF pop, scan ([] $S$ ] ) EOF pop, push nothing ([] ] ) EOF pop, scan ([]) ) EOF pop, scan ([]) EOF EOF pop, scan ([]) EOF empty stack: input accepted

Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):

 $S$ $\longrightarrow$ ( $S$ ) | [ $S$ ] | ( ) | [ ]

If we try to parse an input that starts with a left paren, we are in trouble! We don't know whether to choose the first production: $S$ $\longrightarrow$ ( $S$ ), or the third one: $S$ $\longrightarrow$ ( ). If the next token is a right paren, we want to push "()". If the next token is a left paren, we want to push the three symbols "($S$)". So here we need two tokens of look-ahead.

Self-Study #1

Draw a picture like the one given above to illustrate what the parser for the grammar:

S $\longrightarrow$ $\varepsilon$ | ( S ) | [ S ]

does on the input: "[[]]".

solution

Grammar Transformations

We need to answer two important questions:

1. How to tell whether a grammar is LL(1).
2. How to build the parse (or selector) table for a predictive parser, given an LL(1) grammar.

It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule right-hand side, then the grammar is LL(1).

Before saying how to build the table we will consider two properties that preclude a context-free grammar from being LL(1): left-recursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1).

First, we will introduce one new definition:

A nonterminal $X$ is useless if either:
1. You can't derive a sequence that includes $X$, or
2. You can't derive a string from $X$ (where "string" means epsilon or a sequence of terminals).
Here are some examples of useless nonterminals :

For case 1:

 $S$ $\longrightarrow$ $A$ $B$ $A$ $\longrightarrow$ + | - | $\varepsilon$ $B$ $\longrightarrow$ digit | $B$ digit $C$ $\longrightarrow$ . $B$

For case 2:

 $S$ $\longrightarrow$ $X$ | $Y$ $X$ $\longrightarrow$ ( ) $Y$ $\longrightarrow$ ( $Y$ $Y$ )
$Y$ just derives more and more nonterminals, so it is useless.

From now on "context-free grammar" means a grammar without useless nonterms.

Left RecursionL

• A grammar $G$ is recursive in a nonterminal $X$ if $X$ can derive a sequence of symbols that includes $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow \alpha X \beta$$ where $\alpha$ and $\beta$ are arbitrary sequences of symbols.

• $G$ is left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow X \beta$$ where $\beta$ is an arbitrary sequence of symbols.
• G is immediately left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with X in one step: $$X \Longrightarrow X \beta$$ i.e., the grammar includes the production: $X \longrightarrow X \beta$.

In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation:

Given two productions of the form:  $A$ $\longrightarrow$ $A$ $\alpha$ $|$ $\beta$
where:
• $A$ is a nonterminal
• $\alpha$ is a sequence of terminals and/or nonterminals
• $\beta$ is a sequence of terminals and/or nonterminals not starting with $A$
Replace those two productions with the following three productions:  $A$ $\longrightarrow$ $\beta$ $A'$ $A'$ $\longrightarrow$ $\alpha$ $A'$ | $\varepsilon$
where $A'$ is a new nonterminal.

Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)).

To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar:

etc.

Note that the derived strings are:

• $\beta$
• $\beta$ $\alpha$
• $\beta$ $\alpha$ $\alpha$
• ...

That is, they are of the form "$\beta$, followed by zero or more $\alpha$s". The new grammar derives the same set of strings, but the parse trees have a different shape (the single $\beta$ is derived right away, and then the zero or more $\alpha$s):

etc.

Consider, for instance, the grammar for arithmetic expressions involving only subtraction:

 Exp $\longrightarrow$ Exp minus Factor | Factor Factor $\longrightarrow$ intliteral | ( Exp )

Notice that the first rule ( Exp$\longrightarrow$Exp minus Factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is intliteral, you don't know whether to choose the production Exp$\longrightarrow$Exp minus Factor, or Exp$\longrightarrow$Factor. If the next token is minus, then you should choose Exp$\longrightarrow$Exp minus Factor, but if the next token is EOF, then you should choose Exp$\longrightarrow$Factor.

Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:

 Exp $\longrightarrow$ Factor Exp' Exp' $\longrightarrow$ minus   Factor   Exp' $\;|\;$ $\varepsilon$ Factor $\longrightarrow$ intliteral | ( Exp )

Let's consider what the predictive parser built using this grammar does when the input starts with an integer:

• The predictive parser starts by pushing eof, then Exp onto the stack. Regardless of what the first token is, there is only one production with Exp on the left-hand side, so it will always pop the Exp from the stack and push Factor Exp' as its first action.
• Now the top-of-stack symbol is the nonterminal Factor. Since the input is the intliteral token (not the ( token) it will pop the Factor and push intliteral.
• The top-of-stack symbol is now a terminal (intliteral), which does match the current token, so the stack will be popped, and the scanner called to get the next token.
• Now the top-of-stack symbol is nonterminal Exp'. We'll consider two cases:
1. The next token is minus. In this case, we pop Exp' and push minus Factor Exp'.
2. The next token is eof. In this case, we pop Exp' and push $\varepsilon$ (i.e., push nothing).
So with the new grammar, the parser is able to tell (using only one token look-ahead) what action to perform.

Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2 - 3 - 4 for the old and the new grammars:

Before eliminating Left Recursion:

After eliminating Left Recursion:

The original parse tree shows the underlying structure of the expression; in particular it groups 2 - 3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4 at all! Fortunately, we can design a predictive parser to create an abstract-syntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not.

Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one left-recursive production. Here's a more general rule for removing immediate left recursion:

• For every set of productions of the form: $$A \longrightarrow A \alpha_1 \; | \; A \alpha_2 \; | \ldots | \; A \alpha_m \; | \; \beta_1 \; | \ldots | \; \beta_n$$
• Replace them with the following productions: $$A \longrightarrow \beta_1 A' \; | \; \beta_2 A' \; | \; \ldots \; | \; \beta_n A'$$ $$A' \longrightarrow \alpha_1 A' | \ldots | \alpha_m A' | \varepsilon$$

Note also that there are rules for removing non-immediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here.

A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose right-hand sides have a common prefix. For example, the following grammar is not left factored:

$Exp$ $\longrightarrow$ ( $Exp$ ) | (  )
In this example, the common prefix is "(".

This problem is solved by left-factoring, as follows:

• Given a pair of productions of the form:
$A$ $\longrightarrow$ $\alpha$ $\beta_1$ | $\alpha$ $\beta_2$
where $\alpha$ is a sequence of terminals and/or nonterminals, and $\beta_1$ and $beta_2$ are sequences of terminals and/or nonterminals that do not have a common prefix (and one of the betas could be epsilon),

• Replace those two production with:
$A$ $\longrightarrow$ $\alpha$ $A'$
$A'$ $\longrightarrow$ $\beta_1$ | $\beta_2$

For example, consider the following productions: $Exp$ $\longrightarrow$ ( $Exp$ ) $\;|\;$ (  )

Using the rule defined above, they are replaced by:

$Exp$ $\longrightarrow$ ( $Exp'$
$Exp'$ $\longrightarrow$ $Exp$ ) $\;|\;$ )

Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix):

For each nonterminal $A$:
• Find the longest non-empty prefix $\alpha$ that is common to 2 or more production right-hand sides.
• Replace the productions:  $A$ $\longrightarrow$ $\alpha$ $\beta_1$ | $\ldots$ | $\alpha$ $\beta_m$ | y1 | .. | yn
with:  $A$ $\longrightarrow$ $\alpha$ $A'$   |   $y_1$   |   $\ldots$   |   $y_n$ $A'$ $\longrightarrow$ $\beta_1$ |   $\ldots$   |   $\beta_m$

Repeat this process until no nonterminal has two productions with a common prefix.

Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar.

Here's an example that demonstrates both left-factoring and immediate left-recursion removal:

• The original grammar is:  $Exp$ $\longrightarrow$ ( $Exp$ ) $\;|\;$ $Exp$ $Exp$ $\;|\;$ (  )
• After removing immediate left-recursion, the grammar becomes:  $Exp$ $\longrightarrow$ ( $Exp$ ) $Exp'$ $\;|\;$ (  ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;|\;$ $\varepsilon$
• After left-factoring, this new grammar becomes:  $Exp$ $\longrightarrow$ ( $Exp''$ $Exp''$ $\longrightarrow$ $Exp$ ) $Exp'$ $\;|\;$ ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;|\;$ $\varepsilon$

solution

TEST YOURSELF #2

Using the same grammar: exp ::= ( exp ) | exp exp | ( ), do left factoring first, then remove immediate left recursion.

FIRST and FOLLOW Sets

Recall: A predictive parser can only be built for an LL(1) grammar. A grammar is not LL(1) if it is:

• Left recursive, or
• not left factored.
However, grammars that are not left recursive and are left factored may still not be LL(1). As mentioned earlier, to see if a grammar is LL(1), we try building the parse table for the predictive parser. If any element in the table contains more than one grammar rule right-hand side, then the grammar is not LL(1).

To build the table, we must must compute FIRST and FOLLOW sets for the grammar.

FIRST Sets

Ultimately, we want to define FIRST sets for the right-hand sides of each of the grammar's productions. To do that, we define FIRST sets for arbitrary sequences of terminals and/or nonterminals, or $\varepsilon$ (since that's what can be on the right-hand side of a grammar production). The idea is that for sequence $\alpha$, FIRST($\alpha$) is the set of terminals that begin the strings derivable from $\alpha$, and also, if $\alpha$ can derive $\varepsilon$, then $\varepsilon$ is in FIRST($\alpha$). Using derivation notation: $$\mathrm{FIRST}(\alpha) = \left\{ \; t \; \middle| \; \left( t \; \in \Sigma \wedge \alpha \stackrel{*}\Longrightarrow t \beta \right) \vee \left( t = \varepsilon \wedge \alpha \stackrel{*}\Longrightarrow \varepsilon \right) \right\}$$ To define FIRST($\alpha$) for arbitrary $\alpha$, we start by defining FIRST(X), for a single symbol X (a terminal, a nonterminal, or $\varepsilon$):

1. X $\in \Sigma$ (e.g. X is a terminal): FIRST(X) = {X}
2. X $= \varepsilon$: FIRST(X) = {$\varepsilon$}
3. X $\in N$ (e.g. X is a nonterminal): we must look at all grammar productions with X on the left, i.e., productions of the form:
X $\longrightarrow$ Y1 Y2 Y3 ... Yk
where each Yk is a single terminal or nonterminal (or there is just one Y, and it is $\varepsilon$). For each such production, we perform the following actions:
• Put FIRST(Y1) - {$\varepsilon$} into FIRST(X).
• If $\varepsilon$ is in FIRST(Y1), then put FIRST(Y2) - {epsilon} into FIRST(X).
• If epsilon is in FIRST(Y2), then put FIRST(Y3) - {$\varepsilon$} into FIRST(X).
• etc...
• If $\varepsilon$ is in FIRST(Yi) for 1 $\leq$ i $\leq$ k (all production right-hand sides)) then put $\varepsilon$ into FIRST(X).

For example, consider computing FIRST sets for each of the nonterminals in the following grammar:

 Exp $\longrightarrow$ Term Exp' Exp' $\longrightarrow$ minus Term Exp' | $\varepsilon$ Term $\longrightarrow$ Factor Term' Term' $\longrightarrow$ divide Factor Term' | $\varepsilon$ Factor $\longrightarrow$ intlit | lparens Exp rparens

Here are the FIRST sets (starting with nonterminal factor and working up, since we need to know FIRST(factor) to compute FIRST(term), and we need to know FIRST(term) to compute FIRST(exp)):

Once we have computed FIRST(X) for each terminal and nonterminal X, we can compute FIRST($\alpha$) for every production's right-hand-side $\alpha$. In general, $\alpha$ will be of the form:

X1 X2 ... Xn
where each X is a single terminal or nonterminal, or there is just one X and it is $\varepsilon$. The rules for computing FIRST($\alpha$) are essentially the same as the rules for computing the first set of a nonterminal:
1. Put FIRST(X1) - {$\varepsilon$} into FIRST($\alpha$)
2. If $\varepsilon$ is in FIRST(X1) put FIRST(X2) into FIRST($\alpha$).
3. etc...
4. If $\varepsilon$ is in the FIRST set for every Xk, put $\varepsilon$ into FIRST($\alpha$).

For the example grammar above, here are the FIRST sets for each production right-hand side:

Why do we care about the FIRST($\alpha$) sets? During parsing, suppose that there are two productions:

 Production 1: $A$ $\longrightarrow$ $\alpha$ Production 2: $A$ $\longrightarrow$ $\beta$
Consider the situation when the top-of-stack symbol is A and the current token is a. If a $\in$ FIRST($\alpha$), choose production 1: pop, push $\alpha$. If, on the other hand, a $\in$ FIRST($\beta$), choose production 2 : pop, push $\beta$. We haven't yet given the rules for using FIRST and FOLLOW sets to determine whether a grammar is LL(1); however, you might be able to guess based on this discussion, that if a is in both FIRST($\alpha$) and FIRST($\beta$), the grammar is not LL(1).

FOLLOW sets are only defined for single nonterminals. The definition is as follows:

For a nonterminal $A$, FOLLOW($A$) is the set of terminals that can appear immediately to the right of A in some partial derivation; i.e., terminal t is in FOLLOW(A) if $S \stackrel{+}\Longrightarrow\ldots A \; \mathrm{t} \ldots$ where t is a terminal. If $A$ can be the rightmost symbol in a derivation, then EOF is in FOLLOW($A$).
It is worth noting that $\varepsilon$ is never in a FOLLOW set.

Using notation: $$\mathrm{FOLLOW}(A) = \left\{ \; \mathrm{t} \; \middle| \; \left( \mathrm{t} \; \in \Sigma \wedge S \stackrel{+}\Longrightarrow \alpha \; A \; \mathrm{t} \; \beta \right) \vee \left( \mathrm{t} = \mathrm{EOF} \wedge S \stackrel{*}\Longrightarrow \alpha A \right) \right\}$$

Here are some pictures illustrating the conditions under which symbols a, c, and EOF are in the FOLLOW set of nonterminal A:

Figure 1

Figure 2

Figure 2

How to compute FOLLOW(A) for each nonterminal A:
• If A is the start nonterminal, put EOF in FOLLOW(A) (like S in Fig. 3).
• Now find the productions with A on the right-hand-side:
• For each production $X$ $\longrightarrow$ $\alpha$ $A$ $\beta$, put FIRST($\beta$) - {$\varepsilon$} in FOLLOW($A$) -- see Fig. 1.
• If epsilon is in FIRST($\beta$) then put FOLLOW($X$) into FOLLOW(A) -- see Fig. 2.
• For each production $X$ $\longrightarrow$ $\alpha$ $A$, put FOLLOW(X) into FOLLOW(A) -- see Fig. 3, and Fig. 4 below:

Figure 4

It is worth noting that:

• To compute FIRST(A) you must look for A on a production's left-hand side.
• To compute FOLLOW(A) you must look for A on a production's right-hand side.
• FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, $\varepsilon$ for FIRST sets, and EOF for follow sets). Nonterminals are never in a FIRST or a FOLLOW set; $\varepsilon$ is never in a FOLLOW set.

Here's an example of FOLLOW sets (and the FIRST sets we need to compute them). In this example, nonterminals are upper-case letters, and terminals are lower-case letters.

 $S$ $\longrightarrow$ $B$ c $\;|\;$ $D$ $B$ $B$ $\longrightarrow$ a b $\;|\;$ c $S$ $D$ $\longrightarrow$ d $\;|\;$ $\varepsilon$
 $\alpha$ FIRST($\alpha$) FOLLOW($\alpha$) $D$ {d, $\varepsilon$} {a, c} $B$ {a, c} {c, EOF} $S$ {a, c, d} {EOF, c} Note that FOLLOW($S$) always includes EOF

Now let's consider why we care about FOLLOW sets:

• Suppose, during parsing, we have some $X$ at the top-of-stack, and a is the current token.
• We need to replace $X$ on the stack with the right-hand side of a production $X \longrightarrow \alpha$. What if $X$ has an additional production $X \longrightarrow \beta$. Which one should we use?
• We've already said that if a is in FIRST($\alpha$), but not in FIRST($\beta$), then we want to choose $X \longrightarrow \alpha$.
• But what if a is not in FIRST($\alpha$) or FIRST($\beta$)? If $\alpha$ or $\beta$ can derive $\varepsilon$, and a is in FOLLOW($X$), then we still have hope that the input will be accepted: If $\alpha$ can derive $\varepsilon$ (i.e., $\varepsilon \in$ FIRST($\alpha$), then we want to choose $X \longrightarrow \alpha$ (and similarly if $\beta$ can derive $\varepsilon$). The idea is that since $\alpha$ can derive $\varepsilon$, it will eventually be popped from the stack, and if we're lucky, the next symbol down (the one that was under the $X$) will be a.

TEST YOURSELF #3

Here are five grammar productions for (simplified) method headers:

1. methodHeader -> VOID ID LPAREN paramList RPAREN

2. paramList    -> epsilon
3. paramList    -> nonEmptyParamList

4. nonEmptyParamList -> ID ID
5. nonEmptyParamList -> ID ID COMMA nonEmptyParamList


Question 1: Compute the FIRST and FOLLOW sets for the three nonterminals, and the FIRST sets for each production right-hand side.

Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the top-of-stack that has more than one production.

solution

How to Build Parse Tables

Recall that the form of the parse table is:

Table entry[$X$,a] is either empty (if having $X$ on top of stack and having a as the current token means a syntax error) or contains the right-hand side of a production whose left-hand-side nonterminal is $X$ -- that right-hand side is what should be pushed.

To build the table, we fill in the rows one at a time for each nonterminal $X$ as follows:

for each production $X \longrightarrow \alpha$:
for each terminal t in FIRST($\alpha$):
put $\alpha$ in Table[$X$,t]

if $\epsilon \in$ FIRST($\alpha$) then:
for each terminal t in FOLLOW($X$):
put $\alpha$ in Table[$X$,t]

The grammar is not LL(1) iff there is more than one entry for any cell in the table.

Let's try building a parse table for the following grammar:
 $S$ $\longrightarrow$ $B$ c $\;|\;$ $D$ $B$ $B$ $\longrightarrow$ a b $\;|\;$ c $S$ $D$ $\longrightarrow$ d $\;|\;$ $\varepsilon$
First we calculate the FIRST and FOLLOW sets:
 $\alpha$ FIRST($\alpha$) FOLLOW($\alpha$) $D$ {d, $\varepsilon$} {a, c} $B$ {a, c} {c, EOF} $S$ {a, c, d} {EOF, c} $B$ c {a, c} - $D$ $B$ {d, a, c} - a b {a} - c $S$ {c} - d {d} - $\varepsilon$ {$\varepsilon$} -
Then we use those sets to start filling in the parse table:
Not all entries have been filled in, but already we can see that this grammar is not LL(1) since there are two entries in table[S,a] and in table[S,c].

Here's how we filled in this much of the table:

1. First, we considered the production S -> B c. FIRST(Bc) = { a, c }, so we put the production's right-hand side (B c) in Table[S, a] and in Table[S, c]. FIRST(Bc) does not include epsilon, so we're done with that production.
2. Next, we considered the production S -> D B . FIRST(DB) = { d, a, c }, so we put the production's right-hand side (D B) in Table[S, d], Table[S, a], and Table[S, c].
3. Next, we considered the production D -> epsilon. FIRST(epsilon) = { epsilon }, so we must look at FOLLOW(D). FOLLOW(D) = { a, c }, so we put the production's right-hand side (epsilon) in Table[D, a] and Table[D, c}.

TEST YOURSELF #4

Finish filling in the parse table given above.

solution

How to Code a Predictive Parser

Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a table-driven parser with an explicit stack. Here's pseudo-code for a table-driven predictive parser:

   Stack.push(EOF);
Stack.push(start-nonterminal);
currToken = scan();

while (! Stack.empty()) {
topOfStack = Stack.pop();
if (isNonTerm(topOfStack)) {
// top of stack symbol is a nonterminal
p = table[topOfStack, currToken];
if (p is empty) report syntax error and quit
else {
// p is a production's right-hand side
push p, one symbol at a time, from right to left
}
}
else {
// top of stack symbol is a terminal
if (topOfStack == currToken) currToken = scan();
else report syntax error and quit
}
}


TEST YOURSELF #5

Here is a CFG (with rule numbers):

S -> epsilon               (1)
|  X Y Z                 (2)

X -> epsilon               (3)
|  X S                   (4)

Y -> epsilon               (5)
|  a Y b                 (6)

Z -> c Z                   (7)
|  d                     (8)


Question 1(a): Compute the First and Follow Sets for each nonterminal.

Question 1(b): Draw the Parse Table.

solution