- How to tell whether a grammar is LL(1).
- How to build the parse (or selector) table for a predictive parser, given an LL(1) grammar.
- You can't derive a sequence that includes $X$, or
- You can't derive a string from $X$ (where "string" means epsilon or a sequence of terminals).
-
A grammar $G$ is recursive in a nonterminal $X$ if $X$ can derive a sequence of symbols that includes $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow \alpha X \beta$$ where $\alpha$ and $\beta$ are arbitrary sequences of symbols.
- $G$ is left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow X \beta$$ where $\beta$ is an arbitrary sequence of symbols.
- G is immediately left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with X in one step: $$X \Longrightarrow X \beta$$ i.e., the grammar includes the production: $X \longrightarrow X \beta$.
- $A$ is a nonterminal
- $\alpha$ is a sequence of terminals and/or nonterminals
- $\beta$ is a sequence of terminals and/or nonterminals not starting with $A$
- $\beta$
- $\beta$ $\alpha$
- $\beta$ $\alpha$ $\alpha$ ...
- The predictive parser starts by pushing eof, then Exp onto the stack. Regardless of what the first token is, there is only one production with Exp on the left-hand side, so it will always pop the Exp from the stack and push Factor Exp' as its first action.
- Now the top-of-stack symbol is the nonterminal Factor. Since the input is the intliteral token (not the ( token) it will pop the Factor and push intliteral.
- The top-of-stack symbol is now a terminal (intliteral), which does match the current token, so the stack will be popped, and the scanner called to get the next token.
- Now the top-of-stack symbol is nonterminal Exp'.
We'll consider two cases:
- The next token is minus. In this case, we pop Exp' and push minus Factor Exp'.
- The next token is eof. In this case, we pop Exp' and push $\varepsilon$ (i.e., push nothing).
- For every set of productions of the form: $$A \longrightarrow A \alpha_1 \; | \; A \alpha_2 \; | \ldots | \; A \alpha_m \; | \; \beta_1 \; | \ldots | \; \beta_n$$
- Replace them with the following productions: $$A \longrightarrow \beta_1 A' \; | \; \beta_2 A' \; | \; \ldots \; | \; \beta_n A'$$ $$A' \longrightarrow \alpha_1 A' | \ldots | \alpha_m A' | \varepsilon$$
- Given a pair of productions of the form:
$A$ $\longrightarrow$ $\alpha$ $\beta_1$ | $\alpha$ $\beta_2$where $\alpha$ is a sequence of terminals and/or nonterminals, and $\beta_1$ and $beta_2$ are sequences of terminals and/or nonterminals that do not have a common prefix (and one of the betas could be epsilon),
- Replace those two production with:
$A$ $\longrightarrow$ $\alpha$ $A'$
$A'$ $\longrightarrow$ $\beta_1$ | $\beta_2$ - Find the longest non-empty prefix $\alpha$ that is common to 2 or more production right-hand sides.
- Replace the productions:
$A$ $\longrightarrow$ $\alpha$ $\beta_1$ | $\ldots$ | $\alpha$ $\beta_m$ | y1 | .. | yn $A$ $\longrightarrow$ $\alpha$ $A'$   |   $y_1$   |   $\ldots$   |   $y_n$ $A'$ $\longrightarrow$ $\beta_1$ |   $\ldots$   |   $\beta_m$ Repeat this process until no nonterminal has two productions with a common prefix.
- The original grammar is:
$Exp$ $\longrightarrow$ ( $Exp$ ) $\;|\;$ $Exp$ $Exp$ $\;|\;$ (  ) - After removing immediate left-recursion, the grammar becomes:
$Exp$ $\longrightarrow$ ( $Exp$ ) $Exp'$ $\;|\;$ (  ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;|\;$ $\varepsilon$ - After left-factoring, this new grammar becomes:
$Exp$ $\longrightarrow$ ( $Exp''$ $Exp''$ $\longrightarrow$ $Exp$ ) $Exp'$ $\;|\;$ ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;|\;$ $\varepsilon$ - Left recursive, or
- not left factored.
- X $\in \Sigma$ (e.g. X is a terminal): FIRST(X) = {X}
- X $= \varepsilon$: FIRST(X) = {$\varepsilon$}
- X $\in N$ (e.g. X is a nonterminal): we must look at all grammar
productions with X on the left, i.e., productions of the form:
X $\longrightarrow$ Y1 Y2 Y3 ... Ykwhere each Yk is a single terminal or nonterminal (or there is just one Y, and it is $\varepsilon$). For each such production, we perform the following actions:
- Put FIRST(Y1) - {$\varepsilon$} into FIRST(X).
- If $\varepsilon$ is in FIRST(Y1), then put FIRST(Y2) - {epsilon} into FIRST(X).
- If epsilon is in FIRST(Y2), then put FIRST(Y3) - {$\varepsilon$} into FIRST(X).
- etc...
- If $\varepsilon$ is in FIRST(Yi) for 1 $\leq$ i $\leq$ k (all production right-hand sides)) then put $\varepsilon$ into FIRST(X).
- Put FIRST(X1) - {$\varepsilon$} into FIRST($\alpha$)
- If $\varepsilon$ is in FIRST(X1) put FIRST(X2) into FIRST($\alpha$).
- etc...
- If $\varepsilon$ is in the FIRST set for every Xk, put $\varepsilon$ into FIRST($\alpha$).
- If A is the start nonterminal, put EOF in FOLLOW(A) (like S in Fig. 3).
- Now find the productions with A on the right-hand-side:
- For each production $X$ $\longrightarrow$ $\alpha$ $A$ $\beta$, put FIRST($\beta$) - {$\varepsilon$} in FOLLOW($A$) -- see Fig. 1.
- If epsilon is in FIRST($\beta$) then put FOLLOW($X$) into FOLLOW(A) -- see Fig. 2.
- For each production $X$ $\longrightarrow$ $\alpha$ $A$, put FOLLOW(X) into
FOLLOW(A) -- see Fig. 3, and Fig. 4 below:
Figure 4
- To compute FIRST(A) you must look for A on a production's left-hand side.
- To compute FOLLOW(A) you must look for A on a production's right-hand side.
- FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, $\varepsilon$ for FIRST sets, and EOF for follow sets). Nonterminals are never in a FIRST or a FOLLOW set; $\varepsilon$ is never in a FOLLOW set.
- Suppose, during parsing, we have some $X$ at the top-of-stack, and a is the current token.
- We need to replace $X$ on the stack with the right-hand side of a production $X \longrightarrow \alpha$. What if $X$ has an additional production $X \longrightarrow \beta$. Which one should we use?
- We've already said that if a is in FIRST($\alpha$), but not in FIRST($\beta$), then we want to choose $X \longrightarrow \alpha$.
- But what if a is not in FIRST($\alpha$) or FIRST($\beta$)? If $\alpha$ or $\beta$ can derive $\varepsilon$, and a is in FOLLOW($X$), then we still have hope that the input will be accepted: If $\alpha$ can derive $\varepsilon$ (i.e., $\varepsilon \in$ FIRST($\alpha$), then we want to choose $X \longrightarrow \alpha$ (and similarly if $\beta$ can derive $\varepsilon$). The idea is that since $\alpha$ can derive $\varepsilon$, it will eventually be popped from the stack, and if we're lucky, the next symbol down (the one that was under the $X$) will be a.
- First, we considered the production S -> B c. FIRST(Bc) = { a, c }, so we put the production's right-hand side (B c) in Table[S, a] and in Table[S, c]. FIRST(Bc) does not include epsilon, so we're done with that production.
- Next, we considered the production S -> D B . FIRST(DB) = { d, a, c }, so we put the production's right-hand side (D B) in Table[S, d], Table[S, a], and Table[S, c].
- Next, we considered the production D -> epsilon. FIRST(epsilon) = { epsilon }, so we must look at FOLLOW(D). FOLLOW(D) = { a, c }, so we put the production's right-hand side (epsilon) in Table[D, a] and Table[D, c}.
LL(1) Parsing
Contents
Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples on this page we will use sometimes name terminal symbols using single characters (such as: (, ), [, and ]) instead of the token names (lparen, rparen, etc). Also note that in the picture, the top of stack is to the left.
Grammar:
$S$ | $\longrightarrow$ | $\varepsilon$ $\;|\;$ ( $S$ ) $\;|\;$ [ $S$ ] |
Parse Table:
( | ) | [ | ] | EOF | ||
$S$ | ( $S$ ) | $\varepsilon$ | [ $S$ ] | $\varepsilon$ | $\varepsilon$ |
Input seen so far |   | stack |   | Action |
( | $S$ EOF | pop, push ( $S$ ) | ||
( | ( $S$ ) EOF | pop, scan | ||
([ | $S$ ) EOF | pop, push [ $S$ ] | ||
([ | [ $S$ ] ) EOF | pop, scan | ||
([] | $S$ ] ) EOF | pop, push nothing | ||
([] | ] ) EOF | pop, scan | ||
([]) | ) EOF | pop, scan | ||
([]) EOF | EOF | pop, scan | ||
([]) EOF | empty stack: input accepted |
Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):
$S$ | $\longrightarrow$ | ( $S$ ) | [ $S$ ] | ( ) | [ ] |
If we try to parse an input that starts with a left paren, we are in trouble! We don't know whether to choose the first production: $S$ $\longrightarrow$ ( $S$ ), or the third one: $S$ $\longrightarrow$ ( ). If the next token is a right paren, we want to push "()". If the next token is a left paren, we want to push the three symbols "($S$)". So here we need two tokens of look-ahead.
Draw a picture like the one given above to illustrate what the parser for the grammar:
does on the input: "[[]]".
We need to answer two important questions:
It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule right-hand side, then the grammar is LL(1).
Before saying how to build the table we will consider two properties that preclude a context-free grammar from being LL(1): left-recursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1).
First, we will introduce one new definition:
-
A nonterminal $X$ is useless if either:
For case 1:
$S$ | $\longrightarrow$ | $A$ $B$ |
$A$ | $\longrightarrow$ | + | - | $\varepsilon$ |
$B$ | $\longrightarrow$ | digit | $B$ digit |
$C$ | $\longrightarrow$ | . $B$ |
For case 2:
$S$ | $\longrightarrow$ | $X$ | $Y$ |
$X$ | $\longrightarrow$ | ( ) |
$Y$ | $\longrightarrow$ | ( $Y$ $Y$ ) |
From now on "context-free grammar" means a grammar without useless nonterms.
Left RecursionL
In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation:
-
Given two productions of the form:
$A$ | $\longrightarrow$ | $A$ $\alpha$ |
$|$ | $\beta$ |
$A$ | $\longrightarrow$ | $\beta$ $A'$ |
$A'$ | $\longrightarrow$ | $\alpha$ $A'$ | $\varepsilon$ |
Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)).
To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar:
etc.Note that the derived strings are:
That is, they are of the form "$\beta$, followed by zero or more $\alpha$s". The new grammar derives the same set of strings, but the parse trees have a different shape (the single $\beta$ is derived right away, and then the zero or more $\alpha$s):
etc.Consider, for instance, the grammar for arithmetic expressions involving only subtraction:
Exp | $\longrightarrow$ | Exp minus Factor | Factor |
Factor | $\longrightarrow$ | intliteral | ( Exp ) |
Notice that the first rule ( Exp$\longrightarrow$Exp minus Factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is intliteral, you don't know whether to choose the production Exp$\longrightarrow$Exp minus Factor, or Exp$\longrightarrow$Factor. If the next token is minus, then you should choose Exp$\longrightarrow$Exp minus Factor, but if the next token is EOF, then you should choose Exp$\longrightarrow$Factor.
Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:
Exp | $\longrightarrow$ | Factor Exp' |
Exp' | $\longrightarrow$ | minus Factor Exp' $\;|\;$ $\varepsilon$ |
Factor | $\longrightarrow$ | intliteral | ( Exp ) |
Let's consider what the predictive parser built using this grammar does when the input starts with an integer:
Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2 - 3 - 4 for the old and the new grammars:
Before eliminating Left Recursion:
After eliminating Left Recursion:
The original parse tree shows the underlying structure of the expression; in particular it groups 2 - 3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4 at all! Fortunately, we can design a predictive parser to create an abstract-syntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not.
Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one left-recursive production. Here's a more general rule for removing immediate left recursion:
Note also that there are rules for removing non-immediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here.
A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose right-hand sides have a common prefix. For example, the following grammar is not left factored:
This problem is solved by left-factoring, as follows:
For example, consider the following productions: $Exp$ $\longrightarrow$ ( $Exp$ ) $\;|\;$ (  )
Using the rule defined above, they are replaced by:
$Exp'$ $\longrightarrow$ $Exp$ ) $\;|\;$ )
Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix):
-
For each nonterminal $A$:
Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar.
Here's an example that demonstrates both left-factoring and immediate left-recursion removal:
Using the same grammar: exp ::= ( exp ) | exp exp | ( ), do left factoring first, then remove immediate left recursion.
FIRST and FOLLOW Sets
Recall: A predictive parser can only be built for an LL(1) grammar.
A grammar is not LL(1) if it is:
To build the table, we must must compute FIRST and
FOLLOW sets for the grammar.
Ultimately, we want to define FIRST sets for the right-hand sides
of each of the grammar's productions.
To do that, we define FIRST sets for arbitrary sequences of terminals
and/or nonterminals, or $\varepsilon$ (since that's what can be on the right-hand
side of a grammar production).
The idea is that for sequence $\alpha$, FIRST($\alpha$) is the set of
terminals that begin the
strings derivable from $\alpha$, and also, if $\alpha$ can derive $\varepsilon$,
then $\varepsilon$ is in FIRST($\alpha$).
Using derivation notation:
$$\mathrm{FIRST}(\alpha) = \left\{ \; t \; \middle| \;
\left(
t \; \in \Sigma
\wedge \alpha \stackrel{*}\Longrightarrow t \beta
\right)
\vee
\left( t = \varepsilon
\wedge \alpha \stackrel{*}\Longrightarrow \varepsilon \right)
\right\}
$$
To define FIRST($\alpha$) for arbitrary $\alpha$, we start by defining FIRST(X),
for a single symbol X (a terminal, a nonterminal, or $\varepsilon$):
For example, consider computing FIRST sets for each of the nonterminals
in the following grammar:
Here are the FIRST sets (starting with nonterminal factor and working up,
since we need to know FIRST(factor) to compute FIRST(term), and we need to
know FIRST(term) to compute FIRST(exp)):
Once we have computed FIRST(X) for each terminal and nonterminal X,
we can compute FIRST($\alpha$) for every
production's right-hand-side $\alpha$.
In general, $\alpha$ will be of the form:
For the example grammar above, here are the FIRST sets for each production
right-hand side:
Why do we care about the FIRST($\alpha$) sets?
During parsing, suppose that there are two productions:
FOLLOW sets are only defined for single nonterminals.
The definition is as follows:
Using notation:
$$\mathrm{FOLLOW}(A) = \left\{ \; \mathrm{t} \; \middle| \;
\left(
\mathrm{t} \; \in \Sigma
\wedge S \stackrel{+}\Longrightarrow \alpha \; A \; \mathrm{t} \; \beta
\right)
\vee
\left( \mathrm{t} = \mathrm{EOF}
\wedge S \stackrel{*}\Longrightarrow \alpha A \right)
\right\}
$$
Here are some pictures illustrating the conditions under which symbols
a, c, and EOF are in the FOLLOW set of nonterminal A:
Figure 1 Figure 2 Figure 2
It is worth noting that:
Here's an example of FOLLOW sets (and the FIRST sets we need to
compute them). In this example, nonterminals are upper-case letters, and
terminals are lower-case letters.
Now let's consider why we care about FOLLOW sets:
Here are five grammar productions for (simplified) method headers:
However, grammars that are not left recursive and are
left factored may still not be LL(1).
As mentioned earlier, to see if a grammar is LL(1), we try building
the parse table for the predictive parser.
If any element in the table contains more than one grammar rule
right-hand side, then the grammar is not LL(1).
Exp
$\longrightarrow$
Term Exp'
Exp'
$\longrightarrow$
minus Term Exp' | $\varepsilon$
Term
$\longrightarrow$
Factor Term'
Term'
$\longrightarrow$
divide Factor Term' | $\varepsilon$
Factor
$\longrightarrow$
intlit | lparens Exp rparens
X1 X2 ... Xn
where each X is a single terminal or nonterminal, or there is just one
X and it is $\varepsilon$.
The rules for computing FIRST($\alpha$) are essentially the same as the rules for
computing the first set of a nonterminal:
Production 1:
$A$
$\longrightarrow$
$\alpha$
Production 2:
$A$
$\longrightarrow$
$\beta$
For a nonterminal $A$, FOLLOW($A$) is the set of terminals
that can appear immediately to the right of A in some
partial derivation; i.e., terminal t is in FOLLOW(A) if
$
S \stackrel{+}\Longrightarrow\ldots A \; \mathrm{t} \ldots
$
where t is a terminal. If $A$ can be the rightmost symbol in a derivation,
then EOF is in FOLLOW($A$).
It is worth noting that $\varepsilon$ is never in a FOLLOW set.
$S$
$\longrightarrow$
$B$ c $\;|\;$ $D$ $B$
$B$
$\longrightarrow$
a b $\;|\;$ c $S$
$D$
$\longrightarrow$
d $\;|\;$ $\varepsilon$
$\alpha$
FIRST($\alpha$)
FOLLOW($\alpha$)
$D$
{d, $\varepsilon$}
{a, c}
$B$
{a, c}
{c, EOF}
$S$
{a, c, d}
{EOF, c}
Note that FOLLOW($S$) always includes EOF
1. methodHeader -> VOID ID LPAREN paramList RPAREN
2. paramList -> epsilon
3. paramList -> nonEmptyParamList
4. nonEmptyParamList -> ID ID
5. nonEmptyParamList -> ID ID COMMA nonEmptyParamList
Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the top-of-stack that has more than one production.
How to Build Parse Tables
Recall that the form of the parse table is:
Table entry[$X$,a] is either empty (if having $X$ on top of stack and having a as the current token means a syntax error) or contains the right-hand side of a production whose left-hand-side nonterminal is $X$ -- that right-hand side is what should be pushed.
To build the table, we fill in the rows one at a time for each nonterminal $X$ as follows:
- for each production $X \longrightarrow \alpha$:
- for each terminal t in FIRST($\alpha$):
- put $\alpha$ in Table[$X$,t]
if $\epsilon \in$ FIRST($\alpha$) then:
- for each terminal t in FOLLOW($X$):
- put $\alpha$ in Table[$X$,t]
The grammar is not LL(1) iff there is more than one entry for any cell in the table.
$S$ | $\longrightarrow$ | $B$ c $\;|\;$ $D$ $B$ |
$B$ | $\longrightarrow$ | a b $\;|\;$ c $S$ |
$D$ | $\longrightarrow$ | d $\;|\;$ $\varepsilon$ |
$\alpha$ | FIRST($\alpha$) | FOLLOW($\alpha$) | |||
$D$ | {d, $\varepsilon$} | {a, c} | |||
$B$ | {a, c} | {c, EOF} | |||
$S$ | {a, c, d} | {EOF, c} | |||
$B$ c | {a, c} | - | |||
$D$ $B$ | {d, a, c} | - | |||
a b | {a} | - | |||
c $S$ | {c} | - | |||
d | {d} | - | |||
$\varepsilon$ | {$\varepsilon$} | - |
Here's how we filled in this much of the table:
Finish filling in the parse table given above.
How to Code a Predictive Parser
Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a table-driven parser with an explicit stack. Here's pseudo-code for a table-driven predictive parser:
Stack.push(EOF); Stack.push(start-nonterminal); currToken = scan(); while (! Stack.empty()) { topOfStack = Stack.pop(); if (isNonTerm(topOfStack)) { // top of stack symbol is a nonterminal p = table[topOfStack, currToken]; if (p is empty) report syntax error and quit else { // p is a production's right-hand side push p, one symbol at a time, from right to left } } else { // top of stack symbol is a terminal if (topOfStack == currToken) currToken = scan(); else report syntax error and quit } }
Here is a CFG (with rule numbers):
S -> epsilon (1) | X Y Z (2) X -> epsilon (3) | X S (4) Y -> epsilon (5) | a Y b (6) Z -> c Z (7) | d (8)
Question 1(a): Compute the First and Follow Sets for each nonterminal.
Question 1(b): Draw the Parse Table.