Table of Contents
 Overview
 Scanning
 Defining Syntax
 Syntax for Programming Languages
 SyntaxDirected Definition
 AbstractSyntax Trees
 Parsing
 CYK Parsing
 LL(1) Parsing
 Shift/Reduce Parsing
 SyntaxDirected Translation (SDT)
 Symbol Tables
 Semantic Analysis
 Dataflow
 Dataflow Solving
 Types
 Runtime Environments
 Variable Access
 Parameter Passing
 Code Generation
 Welcome
 Introduction
 The Scanner
 The Parser
 The Semantic Analyzer
 The Intermediate Code Generator
 The Optimizer
 The Code Generator
 A recognizer of some source language S.
 A translator (of programs written in S into programs written in some object or target language T).
 A program written in some host language H
 The scanner groups the characters into lexemes (sequences of characters that "go together").
 Each lexeme corresponds to a token; the scanner forwards each token (plus maybe some additional information, such as a position within at which the token was found in the input file) to the parser.
 Groups tokens into "grammatical phrases", discovering the underlying structure of the source program.
 Finds syntax errors. For example, the source code
5 position = ;
intlit id assign semicolonAll are legal tokens, but that sequence of tokens is erroneous.  Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared.
 Might generate code, or build some intermediate representation of the program such as an abstractsyntax tree.
 The interior nodes of the tree are operators.
 A node's children are its operands.
 Each subtree forms a "logical unit", e.g., the subtree with * at its root shows that because multiplication has higher precedence than addition, this operation must be performed as a unit (not initial+rate).
 The assembler transforms an assembly code file (such as one produced by the compiler) into a relocatable, which is either an object file (typically given the .o extension) or a library (typically given the .so or .a extension, for a dynamicallylinked or staticallylinked library, respectively).
 The linker merges the various object files and static libraries output by the assembler into a single executable.
 Overview
 FiniteState Machines
 Regular Expressions
 Using Regular Expressions and FiniteState Machines to Define a Scanner
 A compiler recognizes legal programs in some (source) language.
 A finitestate machine recognizes legal strings in some language.
 Nodes are states.
 Edges (arrows) are transitions. Each edge should be labeled with a single character. In this example, we've used a single edge labeled "letter" to stand for 52 edges labeled 'a', 'b', $\ldots$, 'z', 'A', $\ldots$, 'Z'. (Similarly, the label "letter,digit" stands for 62 edges labeled 'a',...'Z','0',...'9'.)
 S is the start state; every FSM has exactly one (a standard convention is to label the start state "S").
 A is a final state. By convention, final states are drawn using a double circle, and nonfinal states are drawn using single circles. A FSM may have more than one final state.
 The FSM starts in its start state.
 If there is a edge out of the current state whose label matches the current input character, then the FSM moves to the state pointed to by that edge, and "consumes" that character; otherwise, it gets stuck.
 The finitestate machine stops when it gets stuck or when it has consumed all of the input characters.
 The entire string is consumed (the machine did not get stuck), and
 the machine ends in a final state.
x
tmp2
XyZzy
position27
123
a?
13apples
 a sequence of one or more letters and/or digits,
 followed by an atsign,
 followed by one or more letters,
 followed by zero or more extensions.
 An extension is a dot followed by one or more letters.
 $Q$ is a finite set of states ($\{\mathrm{S,A,B}\}$ in the above example).
 $\Sigma$ (an uppercase sigma) is the alphabet of the machine, a finite set of characters that label the edges ($\{+,,0,1,...,9\}$ in the above example).
 $q$ is the start state, an element of $Q$ ($\mathrm{S}$ in the above example).
 $F$ is the set of final states, a subset of $Q$ ({B} in the above example).
 $\delta$ is the state transition relation: $Q \times \Sigma \rightarrow Q$ (i.e., it is a function that takes two arguments  a state in $Q$ and a character in $\Sigma$  and returns a state in $Q$).
 Deterministic:
 No state has more than one outgoing edge with the same label.
 NonDeterministic:
 States may have more than one outgoing edge with same label.
 Edges may be labeled with $\varepsilon$ (epsilon), the empty string. The FSM can take an $\varepsilon$transition without looking at the current input character.
 Have a variable named state, initialized to S (the start state).
 Repeat:
 read the next character from the input
 use the table to assign a new value to the state variable
 $\mathrm{digit} \;  \; \mathrm{letter} \; \mathrm{letter}$
 $\mathrm{digit} \;  \; \mathrm{letter} \; \mathrm{letter}$*
 $\mathrm{digit} \;  \; \mathrm{letter}$*
 ID + ID
 ID * ID
 ID == ID
 The scanner sometimes needs to look one or more characters
beyond
the last character of the current token, and then needs to "put
back" those characters so that the next time the scanner is called
it will have the correct current character.
For example, when scanning a program written in the simple
assignmentstatement language defined above, if the input is
"==", the scanner should return the EQUALS token, not two ASSIGN
tokens. So if the current character is "=", the scanner must
look at the next character to see whether it is another "="
(in which case it will return EQUALS), or is some other character
(in which case it will put that character back and return ASSIGN).
 It is no longer correct to run the FSM program until the machine gets stuck or endofinput is reached, since in general the input will correspond to many tokens, not just a single token.
 modify the machines so that a state can have an associated action to "put back N characters" and/or to "return token XXX",
 we must combine the finitestate machines for all of the tokens in to a single machine, and
 we must write a program for the "combined" machine.
 The table will include a column for endoffile as well as for all possible characters (the endoffile column is needed, for example, so that the scanner works correctly when an identifier is the last token in the input).
 Each table entry may include an action as well as or instead of a new state.
 Instead of repeating "read a character; update the state variable" until the machine gets stuck or the entire input is read, the code will repeat: "read a character; perform the action and/or update the state variable" (eventually, the action will be to return a value, so the scanner code will stop, and will start again in the start state next time it is called).
 Overview
 Example: Simple Arithmetic Expressions
 Formal Definition
 Example: Boolean Expressions, Assignment Statements, and If Statements
 SelfStudy #1
 The Language Defined by a CFG
 Groups the tokens into "grammatical phrases".
 Discovers the underlying structure of the program.
 Finds syntax errors.
 Perhaps also performs some actions to find other kinds of errors.
 an abstractsyntax tree (maybe + a symbol table),
 or intermediate code,
 or object code.
 An integer is an arithmetic expression.
 If exp_{1} and exp_{2} are arithmetic expressions,
then so are the following:
 exp_{1}  exp_{2}
 exp_{1} / exp_{2}
 ( exp_{1} )
 The grammar has five terminal symbols: intliteral minus divide lparen rparen. The terminals of a grammar used to define a programming language are the tokens returned by the scanner.
 The grammar has one nonterminal: Exp (note that a single name, Exp, is used instead of Exp_{1} and Exp_{2} as in the English definition above).
 The grammar has four productions or rules,
each of the form:
Exp $\longrightarrow$ ...A production lefthand side is a single nonterminal. A production righthand side is either the special symbol $\varepsilon$ (the same $\varepsilon$ that can be used in a regular expression) or a sequence of one or more terminals and/or nonterminals (there is no rule with $\varepsilon$ on the righthand side in the example given above).
 $N$ is a set of nonterminals.
 $\Sigma$ is a set of terminals.
 $P$ is a set of productions (or rules).
 $S$ is the start nonterminal (sometimes called the goal nonterminal) in $N$. If not specified, then it is the nonterminal that appears on the lefthand side of the first production.
 "true" is a boolean expression, recognized by the token true.
 "false" is a boolean expression, recognized by the token false.
 If exp_{1} and exp_{2} are boolean expressions, then so are the following:
 exp_{1}  exp_{2}
 exp_{1} && exp_{2}
 ! exp_{1}
 ( exp_{1} )
 The word "if", followed by a boolean expression in parentheses, followed by a statement, or
 The word "if", followed by a boolean expression in parentheses, followed by a statement, followed by the word "else", followed by a statement.
 Start by setting the "current sequence" to be the start nonterminal.
 Repeat:
 find a nonterminal $X$ in the current sequence;
 find a production in the grammar with $X$ on the left (i.e., of the form $X$ → $\alpha$, where $\alpha$ is either $\varepsilon$ (the empty string) or a sequence of terminals and/or nonterminals);
 Create a new "current sequence" in which $\alpha$ replaces the $X$ found above;
 $\Longrightarrow$ means derives in one step
 $\stackrel{+}\Longrightarrow$ means derives in one or more steps
 $\stackrel{*}\Longrightarrow$ means derives in zero or more steps
 $S$ is the start nonterminal of $G$
 $w$ is a sequence of terminals or $\varepsilon$
 Start with the start nonterminal.
 Repeat:
 choose a leaf nonterminal $X$
 choose a production $X \longrightarrow \alpha$
 the symbols in $\alpha$ become the children of $X$ in the tree
 Overview
 Ambiguous Grammars
 Disambiguating Grammars
 List Grammars
 A Grammar for a Programming Language
 Summary
 more than one leftmost derivation of $w$ or,
 more than one rightmost derivation of $w$, or
 more than one parse tree for $w$
 A grammar is recursive in nonterminal $X$ if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that includes an $X$).
 A grammar is left recursive in $X$ if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that starts with an $X$).
 A grammar is right recursive in $X$ if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps, $X$ derives a sequence of symbols that ends with an $X$).
 For left associativity, use left recursion.
 For right associativity, use right recursion.
 One or more PLUSes (without any separator or terminator).
(Remember, any of the following three grammars defines
this language; you don't need all three lines).
 xList $\longrightarrow$ PLUS  xList xList
 xList $\longrightarrow$ PLUS  xList PLUS
 xList $\longrightarrow$ PLUS  PLUS xList
 One or more runs of one or more PLUSes, each run separated by commas:
 xList $\longrightarrow$ PLUS  xList COMMA xList
 xList $\longrightarrow$ PLUS  xList COMMA PLUS
 xList $\longrightarrow$ PLUS  PLUS COMMA xList
 One or more PLUSes, each PLUS terminated by a semicolon:
 xList $\longrightarrow$ PLUS SEMICOLON  xList xList
 xList $\longrightarrow$ PLUS SEMICOLON  xList PLUS SEMICOLON
 xList $\longrightarrow$ PLUS SEMICOLON  PLUS SEMICOLON xList
 Zero or more PLUSes (without any separator or terminator):
 xList $\longrightarrow$ $\varepsilon$  PLUS  xList xList
 xList $\longrightarrow$ $\varepsilon$  PLUS  xList PLUS
 xList $\longrightarrow$ $\varepsilon$  PLUS  PLUS xList
 Zero or more PLUSes, each PLUS terminated by a semicolon:
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  xList xList
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  xList PLUS SEMICOLON
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  PLUS SEMICOLON xList

The trickiest kind of list is a list of
zero or more x's, separated by commas.
To get it right, think of the definition as follows:

Either an empty list, or a nonempty list of x's separated
by commas.
xList $\longrightarrow$ $\varepsilon$  nonemptyList nonemptyList $\longrightarrow$ PLUS  PLUS COMMA nonemptyList  A program must be a list of functions (and nothing else); no statements may appear outside of a function (similar to C), nor may any variable declarations appear outside of a function (unlike C, which allows global variable declarations outside of functions).
 A program must contain at least one function, as the FunctionList cannot immediately derive $\varepsilon$.
 terminal symbol
 nonterminal symbol
 grammar rule (or production)
 derivation (leftmost derivation, rightmost derivation)
 parse (or derivation) tree
 the language defined by a grammar
 ambiguous grammar
 Motivation and Definition
 Example 1: Value of an Arithmetic Expression
 Example 2: Type of an Expression
 Test Yourself #1
 constants
 the righthandside nonterminals' translations
 the righthandside tokens' values (e.g., the integer value associated with an INTLITERAL token, or the String value associated with an ID token)
 Build the parse tree.
 Use the translation rules to compute the translation of each nonterminal in the tree, working bottom up (since a nonterminal's value may depend on the value of the symbols on the righthand side, you need to work bottomup so that those values are available).
 Both operands of the + operator must be of type $int$.
 Both operands of the && operator must be of type $bool$.
 Both operands of the == operator must have the same (non$error$) type.
 Operators appear at internal nodes instead of at leaves.
 "Chains" of single productions are collapsed.
 Lists are "flattened".
 Syntactic details (e.g., parentheses, commas, semicolons) are omitted.
 The grouping symbols are necessary to build the particular parse tree in this example (since otherwise the higherprecedence of multiplication would have grouped the 3 with the 4), and the grouping symbols must be kept as leaves of the parse tree in order to fulfill the property that reading the frontier of the parse tree yields the complete token stream. However, the grouping is implicit in the AST via the parentchild relationship of AST nodes.

Because the abstract syntax tree omits syntactic details, many parse trees
may yield the same abstract syntax tree. For example, the following input strings would have
different parse trees but the same AST as the one above:
(3 * (4 + 2))
(3) * ((4) + (2))
3 * ((4 + 2))
(3 * (4 + 2))
 More general than LL(1) grammars (every LL(1) grammar is also LALR(1) but not vice versa).
 The class of grammars accepted by the parser generators yacc, bison, and Java Cup.
 Parsed by bottomup parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal).
 Harder to understand than LL(1) parsers (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)).
 A general, bottomup parsing algorithm called the CYK algorithm.
 A technique for parsing LL(1) grammars topdown, called predictive parsing.
 A technique for bottomup parsing that works for LALR(1) grammars, but since that is rather complicated, we will study a version of the algorithm that only works for a subset of the LALR(1) grammars called SLR(1) grammars.
 $X$ $\longrightarrow$ t
 $X$ $\longrightarrow$ $A$ $B$
 Eliminate $\varepsilon$ rules
 Eliminate unit rules (rules with exactly one nonterminal on the right)
 Fix remaining rules so that all rules have either a single terminal or exactly two nonterminals on the right.
 Add a new start nonterminal $S'$
 And new rules
$S'$ $\longrightarrow$ $S$ $S'$ $\longrightarrow$ $\varepsilon$  Remove the rule
$S$ $\longrightarrow$ $\varepsilon$  How to tell whether a grammar is LL(1).
 How to build the parse (or selector) table for a predictive parser, given an LL(1) grammar.
 You can't derive a sequence that includes $X$, or
 You can't derive a string from $X$ (where "string" means epsilon or a sequence of terminals).

A grammar $G$ is recursive in a nonterminal $X$ if $X$ can derive a sequence of symbols that includes $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow \alpha X \beta$$ where $\alpha$ and $\beta$ are arbitrary sequences of symbols.
 $G$ is left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with $X$, in one or more steps: $$X \stackrel{+}\Longrightarrow X \beta$$ where $\beta$ is an arbitrary sequence of symbols.
 G is immediately left recursive in nonterminal $X$ if $X$ can derive a sequence of symbols that starts with X in one step: $$X \Longrightarrow X \beta$$ i.e., the grammar includes the production: $X \longrightarrow X \beta$.
 $A$ is a nonterminal
 $\alpha$ is a sequence of terminals and/or nonterminals
 $\beta$ is a sequence of terminals and/or nonterminals not starting with $A$
 $\beta$
 $\beta$ $\alpha$
 $\beta$ $\alpha$ $\alpha$ ...
 The predictive parser starts by pushing eof, then Exp onto the stack. Regardless of what the first token is, there is only one production with Exp on the lefthand side, so it will always pop the Exp from the stack and push Factor Exp' as its first action.
 Now the topofstack symbol is the nonterminal Factor. Since the input is the intliteral token (not the ( token) it will pop the Factor and push intliteral.
 The topofstack symbol is now a terminal (intliteral), which does match the current token, so the stack will be popped, and the scanner called to get the next token.
 Now the topofstack symbol is nonterminal Exp'.
We'll consider two cases:
 The next token is minus. In this case, we pop Exp' and push minus Factor Exp'.
 The next token is eof. In this case, we pop Exp' and push $\varepsilon$ (i.e., push nothing).
 For every set of productions of the form: $$A \longrightarrow A \alpha_1 \;  \; A \alpha_2 \;  \ldots  \; A \alpha_m \;  \; \beta_1 \;  \ldots  \; \beta_n$$
 Replace them with the following productions: $$A \longrightarrow \beta_1 A' \;  \; \beta_2 A' \;  \; \ldots \;  \; \beta_n A'$$ $$A' \longrightarrow \alpha_1 A'  \ldots  \alpha_m A'  \varepsilon$$
 Given a pair of productions of the form:
$A$ $\longrightarrow$ $\alpha$ $\beta_1$  $\alpha$ $\beta_2$where $\alpha$ is a sequence of terminals and/or nonterminals, and $\beta_1$ and $beta_2$ are sequences of terminals and/or nonterminals that do not have a common prefix (and one of the betas could be epsilon),
 Replace those two production with:
$A$ $\longrightarrow$ $\alpha$ $A'$
$A'$ $\longrightarrow$ $\beta_1$  $\beta_2$  Find the longest nonempty prefix $\alpha$ that is common to 2 or more production righthand sides.
 Replace the productions:
$A$ $\longrightarrow$ $\alpha$ $\beta_1$  $\ldots$  $\alpha$ $\beta_m$  y_{1}  ..  y_{n} $A$ $\longrightarrow$ $\alpha$ $A'$  $y_1$  $\ldots$  $y_n$ $A'$ $\longrightarrow$ $\beta_1$  $\ldots$  $\beta_m$ Repeat this process until no nonterminal has two productions with a common prefix.
 The original grammar is:
$Exp$ $\longrightarrow$ ( $Exp$ ) $\;\;$ $Exp$ $Exp$ $\;\;$ ( )  After removing immediate leftrecursion, the grammar becomes:
$Exp$ $\longrightarrow$ ( $Exp$ ) $Exp'$ $\;\;$ ( ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;\;$ $\varepsilon$  After leftfactoring, this new grammar becomes:
$Exp$ $\longrightarrow$ ( $Exp''$ $Exp''$ $\longrightarrow$ $Exp$ ) $Exp'$ $\;\;$ ) $Exp'$ $Exp'$ $\longrightarrow$ $Exp$ $Exp'$ $\;\;$ $\varepsilon$  Left recursive, or
 not left factored.
 X $\in \Sigma$ (e.g. X is a terminal): FIRST(X) = {X}
 X $= \varepsilon$: FIRST(X) = {$\varepsilon$}
 X $\in N$ (e.g. X is a nonterminal): we must look at all grammar
productions with X on the left, i.e., productions of the form:
X $\longrightarrow$ Y_{1} Y_{2} Y_{3} ... Y_{k}where each Y_{k} is a single terminal or nonterminal (or there is just one Y, and it is $\varepsilon$). For each such production, we perform the following actions:
 Put FIRST(Y_{1})  {$\varepsilon$} into FIRST(X).
 If $\varepsilon$ is in FIRST(Y_{1}), then put FIRST(Y_{2})  {epsilon} into FIRST(X).
 If epsilon is in FIRST(Y_{2}), then put FIRST(Y_{3})  {$\varepsilon$} into FIRST(X).
 etc...
 If $\varepsilon$ is in FIRST(Y_{i}) for 1 $\leq$ i $\leq$ k (all production righthand sides)) then put $\varepsilon$ into FIRST(X).
 Put FIRST(X_{1})  {$\varepsilon$} into FIRST($\alpha$)
 If $\varepsilon$ is in FIRST(X_{1}) put FIRST(X_{2}) into FIRST($\alpha$).
 etc...
 If $\varepsilon$ is in the FIRST set for every X_{k}, put $\varepsilon$ into FIRST($\alpha$).
 If A is the start nonterminal, put EOF in FOLLOW(A) (like S in Fig. 3).
 Now find the productions with A on the righthandside:
 For each production $X$ $\longrightarrow$ $\alpha$ $A$ $\beta$, put FIRST($\beta$)  {$\varepsilon$} in FOLLOW($A$)  see Fig. 1.
 If epsilon is in FIRST($\beta$) then put FOLLOW($X$) into FOLLOW(A)  see Fig. 2.
 For each production $X$ $\longrightarrow$ $\alpha$ $A$, put FOLLOW(X) into
FOLLOW(A)  see Fig. 3, and Fig. 4 below:
Figure 4
 To compute FIRST(A) you must look for A on a production's lefthand side.
 To compute FOLLOW(A) you must look for A on a production's righthand side.
 FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, $\varepsilon$ for FIRST sets, and EOF for follow sets). Nonterminals are never in a FIRST or a FOLLOW set; $\varepsilon$ is never in a FOLLOW set.
 Suppose, during parsing, we have some $X$ at the topofstack, and a is the current token.
 We need to replace $X$ on the stack with the righthand side of a production $X \longrightarrow \alpha$. What if $X$ has an additional production $X \longrightarrow \beta$. Which one should we use?
 We've already said that if a is in FIRST($\alpha$), but not in FIRST($\beta$), then we want to choose $X \longrightarrow \alpha$.
 But what if a is not in FIRST($\alpha$) or FIRST($\beta$)? If $\alpha$ or $\beta$ can derive $\varepsilon$, and a is in FOLLOW($X$), then we still have hope that the input will be accepted: If $\alpha$ can derive $\varepsilon$ (i.e., $\varepsilon \in$ FIRST($\alpha$), then we want to choose $X \longrightarrow \alpha$ (and similarly if $\beta$ can derive $\varepsilon$). The idea is that since $\alpha$ can derive $\varepsilon$, it will eventually be popped from the stack, and if we're lucky, the next symbol down (the one that was under the $X$) will be a.
 First, we considered the production S > B c. FIRST(Bc) = { a, c }, so we put the production's righthand side (B c) in Table[S, a] and in Table[S, c]. FIRST(Bc) does not include epsilon, so we're done with that production.
 Next, we considered the production S > D B . FIRST(DB) = { d, a, c }, so we put the production's righthand side (D B) in Table[S, d], Table[S, a], and Table[S, c].
 Next, we considered the production D > epsilon. FIRST(epsilon) = { epsilon }, so we must look at FOLLOW(D). FOLLOW(D) = { a, c }, so we put the production's righthand side (epsilon) in Table[D, a] and Table[D, c}.
 Almost all programming languages have LR grammars.
 LR parsers take time and space linear in the size of the input (with a constant factor determined by the grammar).
 LR is strictly more powerful than LL (for example, every LL(1) grammar is also both LALR(1) and LR(1), but not vice versa).
 LR grammars are more "natural" than LL grammars (e.g., the grammars for expression languages get mangled when we remove the left recursion to make them LL(1), but that isn't necessary for an LALR(1) or an LR(1) grammar).
 Although an LR grammar is usually easier to understand than the corresponding LL grammar, the parser itself is harder to understand and to write (thus, LR parsers are built using parser generators, rather than being written by hand).
 When we use a parser generator, if the grammar that we provide is not LALR(1), it can be difficult to figure out how to fix that.
 Error repair may be more difficult using LR parsing than using LL.
 Table sizes may be larger (about a factor of 2) than those used for LL parsing.
 start with the start symbol (i.e., the current string is the start symbol)
 repeat:
 choose a nonterminal X in the current string
 choose a grammar rule X $\longrightarrow$ $\alpha$
 replace X in the current string with $\alpha$
 $\Longrightarrow$ $E$ + $T$
 $\Longrightarrow$ $E$ + $T$ * $F$
 $\Longrightarrow$ $E$ + $T$ * id
 $\Longrightarrow$ $E$ + $F$ * id
 $\Longrightarrow$ $E$ + id * id
 $\Longrightarrow$ $T$ + id * id
 $\Longrightarrow$ $F$ + id * id
 $\Longrightarrow$ id + id * id
 the topofstack symbol and
 the current input symbol (token) and
 the entry in one of the parse tables (indexed by the topofstack and currentinput symbols)
 shift: push the current input symbol onto the stack, and go on to the next input symbol (i.e., call the scanner again)
 reduce: a grammar rule's righthand side is on the top of the stack! pop it off and push the grammar rule's lefthandside nonterminal
 accept: accept the input (parsing has finished successfully)
 reject: the input is not syntactically correct
 the definition of an item
 the number of states in the underlying finitestate machine
 the amount of information in a state
 Topic Introduction
 TopDown SDT
 Example: Counting Parentheses
 SelfStudy #1
 Handling NonLL(1) Grammars
 SelfStudy #2
 BottomUp SDT
 Summary
 The semantic stack holds nonterminals' translations; when the parse is finished, it will hold just one value: the translation of the root nonterminal (which is the translation of the whole input).
 Values are pushed onto the semantic stack (and popped off) by
adding actions to the grammar rules. The action for one
rule must:
 Pop the translations of all righthandside nonterminals.
 Compute and push the translation of the lefthandside nonterminal.
 The actions themselves are represented by action numbers, which become part of the righthand sides of the grammar rules. They are pushed onto the (normal) stack along with the terminal and nonterminal symbols. When an action number is the topofstack symbol, it is popped and the action is carried out.
 Pop all righthandside nonterminals' translations from the semantic stack.
 Compute and push the lefthandside nonterminal's translation.
 A variable should not be declared more than once in the same scope.
 A variable should not be used before being declared.
 The type of the lefthand side of an assignment should match the type of the righthand side.
 Methods should be called with the right number and types of arguments.
 For each scope in the program: Process the declarations, adding new entries to the symbol table and reporting any variables that are multiply declared; process the statements, finding uses of undeclared variables, and updating the "ID" nodes of the abstractsyntax tree to point to the appropriate symboltable entry.
 Process all of the statements in the program again, using the symboltable information to determine the type of each expression, and finding type errors.
 which kind of name it is
 what is its type
 what is its nesting level
 where will it be found at runtime.
 The outer scope for f includes parameter k, and the variable x that is initialized to 0.
 The next scope is for the body of the while loop, and includes the variable x that is initialized to 1.
 The innermost scope is for the body of the if, and includes the variable x that is initialized to 5.5.
 uses static scoping
 requires that all names be declared before they are used
 does not allow multiple declarations of a name in the same scope (even for different kinds of names)
 does allow the same name to be declared in multiple nested scopes (but only once per scope)
 uses the same scope for a method's parameters and for the local variables declared at the beginning of the method
 The symbol table will be used to answer two questions:
 Given a declaration of a name, is there already a declaration of the same name in the current scope (i.e., is it multiply declared)?
 Given a use of a name, to which declaration does it correspond (using the "most closely nested" rule), or is it undeclared?
 The symbol table is only needed to answer those two questions (i.e., once all declarations have been processed to build the symbol table, and all uses have been processed to link each ID node in the abstractsyntax tree with the corresponding symboltable entry, the symbol table itself is no longer needed).
Overview
Hello, and welcome to the course readings for Drew Davidson's EECS 665, "Compiler Construction". This set of course notes has been shaped by me (Drew) in order to serve as a written companion to the course lectures and discussions offered as part of EECS665. As such, it's written pretty informally, and will often draw on my personal observations and experience. These readings are not intended to serve as a standalone reference for compiler topics or as a guide for building a compiler. Thus, I'll start these readings with a disclaimer exactly opposite of most textbooks: it is not for a general audience, and it is intended to be read straight through from start to back.
Some additional disclaimers: In the interest of brevity, these readings will occasionally adopt a definition and stick to it without fully exploring alternative definitions, or considering all implications of the definition. If you feel that a superior definition exists, please do let me know. Furthermore, the readings constitute a living document, updated and refined when possible. If you see a typo, mistake, or even just something where you'd like some more writing, please do let me know.
While much of the material in the course readings has been updated, modified, or written by me, it is based originally on a set of readings created by Susan Horwitz at the University of WisconsinMadison. In the interest of providing the best possible guide to the subject, I will frequently read or reread classic textbooks on compilers, and whatever course materials from other classes might be floating around on the internet. It is my intention to synthesize this information and provide wholly original material herein. That being said, if you see an example or text in the readings that you feel has not been appropriately credited to an external source, please contact me at your earliest convenience so that we can arrange for the material to be taken down, cited, or come to another arrangement.
IntroductionWhat is a compiler? A common answer amongst my students is something like "A program that turn code into a program", which is honestly not bad (if a bit imprecise). Without dwelling overly much on the messy details, let's refine this definition and run with it. For this class, we'll say that a compiler is:
Here's a simple pictorial view:
A compiler operates in phases; each phase translates the source program from one representation to another. Different compilers may include different phases, and/or may order them somewhat differently. A typical organization is shown below.
Below, we look at each phase of the compiler, giving a highlevel overview of how
the code snippet position = initial + rate + 60
is represented
throughout the above workflow.
The Scanner
The scanner performs lexical analysis, which means to group the input characters into the fundamental units of the source language. If some character(s) of input fails to correspond to a valid group, the scanner may also report lexical errors.
Some terminology for lexical analysis:
The definitions of what is a lexeme, token, or bad character all depend on the source language.
Example 1
Here are some Clike lexemes and the corresponding tokens:
lexeme: ; = index tmp 37 102 corresponding token: semicolon assign id id intlit intlit
Note that multiple lexemes can correspond to the same token (e.g., the index
and tmp
lexemes both correspond to id tokens.
Example 2 (the running example)
Given the source code:
position = initial + rate * 60 ;
a Clike scanner would return the following sequence of tokens:
id assign id plus id times intlit semicolon
The errors that a scanner might report are usually due to characters that cannot be a valid part of any token.
Erroneous characters for Java source include hash (#) and controla.
Erroneous characters for C source include the carat (^).
The Parser
The parser performs syntactic analysis, which means to group the tokens of the language into constructs of the programming language (such as loops, expressions, functions, etc). A good analogy is to think of the scanner as building the "words" of the programming language, and the parser as building the "sentences" or "grammatical phrases" of the source program.
Example
source code: position = initial + rate * 60 ;
Abstract syntax tree:
Key features of the AbstractSyntax Tree (AST):
While the above presentation has implied that syntactic analysis comes strictly after lexical analysis has been completed, the two phases can often be interleaved, with the parser serving as the "driver" module: the parser sends requests to the lexical analysis to produce one token at a time, which is then added to the AST or rejected if syntactically invalid. Thus, the AST is built incrementally. One reason for this behavior is that if problems are detected early in the parse, the compiler can emit an error quickly. This "failfast" behavior is a major design goal in compiler construction, as it can save users time and compute cycles.
The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. It may also annotate and/or change the abstract syntax tree (e.g., it might annotate each node that represents an expression with its type).
Example:
Abstract syntax tree before semantic analysis:
Abstract syntax tree after semantic analysis:
The Intermediate Code Generator
The intermediate code generator translates from abstractsyntax tree to intermediate code. One possibility is 3address code (code in which each instruction involves at most 3 operands). Below is an example of 3address code for the abstractsyntax tree shown above. Note that in this example, the second and third instructions each have exactly three operands (the location where the result of the operation is stored, and two operators); the first and fourth instructions have just two operands ("temp1" and "60" for instruction 1, and "position" and "temp3" for instruction 4).
temp1 = inttofloat(60) temp2 = rate * temp1 temp3 = initial + temp2 position = temp3
The Optimizer
The optimizer tries to improve code generated by the intermediate code generator. The goal is usually to make code run faster, but the optimizer may also try to make the code smaller. In the example above, an optimizer might first discover that the conversion of the integer 60 to a floatingpoint number can be done at compile time instead of at run time. Then it might discover that there is no need for "temp1" or "temp3". Here's the optimized code:
temp2 = rate * 60.0 position = initial + temp2
The Code Generator
The code generator generates assembly code from (optimized) intermediate code. For example, the following x64 code might be generated for our running example:
movl rate(%rip), %eax imull $60, %eax, %edx movl initial(%rip), %eax addl %edx, %eax movl %eax, position(%rip)
It's worth noting that the above representation of the running example is still a text file, though one that the compiler proper considers to be the final output. There is still work left to be done before we have an executable program. Usually, a collection of programs, collectively referred to as a compiler toolchain, work in sequence to build the executable:
When it comes time for the program to actually be run, the loader maps the executable file into memory (alongside whatever dynamic libraries may be required) as a process, and then hands control to the operating system to actually begin execution of the program. In modern workstation computing, the loader is integrated into the Operating System, but in some legacy, specialized, and embedded systems the loader is still a distinct program.
Students who use compilers like gcc or llvm are often surprised at the existence of distinct "postcompilation" steps to produce an executable. Indeed the term "compiler" is often used to refer to the full "compiler toolchain".
Nevertheless, the workflow from compilation, to assembly, to linkage, is represented as a series of distinct modules within many popular compilers, and each of the tools can be invoked separately. For example, running the toolchain for C gcc can be accomplished using cc (the c compiler proper), as (the gnu assembler), and ld (the gnu linker).
Scanning
OverviewL
Recall that the job of the scanner is to translate the sequence of characters that is the input to the compiler to a corresponding sequence of tokens. In particular, each time the scanner is called it should find the longest sequence of characters in the input, starting with the current character, that corresponds to a token, and should return that token.
It is possible to write a scanner from scratch, but it is often much easier and less errorprone approach is to use a scanner generator like lex or flex (which produce C or C++ code), or jlex (which produces java code). The input to a scanner generator includes one regular expression for each token (and for each construct that must be recognized and ignored, such as whitespace and comments). Therefore, to use a scanner generator you need to understand regular expressions. To understand how the scanner generator produces code that correctly recognizes the strings defined by the regular expressions, you need to understand finitestate machines (FSMs).
Finite State MachinesLA finitestate machine is similar to a compiler in that:
In both cases, the input (the program or the string) is a sequence of characters.
Here's an example of a finitestatemachine that recognizes Pascal identifiers (sequences of one or more letters or digits, starting with a letter):
In this picture:
A FSM is applied to an input (a sequence of characters). It either accepts or rejects that input. Here's how the FSM works:
An input string is accepted by a FSM if:
The language defined by a FSM is the set of strings accepted by the FSM.
The following strings are in the language of the FSM shown above:
The following strings are not in the language of the FSM shown above:
Write a finitestate machine that accepts email addresses, defined as follows:
Let us consider another example of an FSM:
Example: Integer LiteralsLThe following is a finitestate machine that accepts integer literals with an optional + or  sign:
Formal Definition
An FSM is a 5tuple: $(Q,\Sigma,\delta,q,F)$
Here's a definition of $\delta$ for the above example, using a state transition table:
+    $\mathrm{digit}$  
$S$  $A$  $A$  $B$ 
$A$  $B$  
$B$  $B$ 
Deterministic and NonDeterministic FSMsL
There are two kinds of FSM:
Example
Here is a nondeterministic finitestate machine that recognizes the same language as the second example deterministic FSM above (the language of integer literals with an optional sign):
Sometimes, nondeterministic machines are simpler than deterministic ones, though not in this example.
A string is accepted by a nondeterministic finitestate machine if there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. For example, here's what happens when the above machine is run on the input "+75":
After scanning  Can be in these states  
(nothing)  $S$  $A$  
$+$  $A$  (stuck)  
$+7$  $B$  (stuck)  
$+75$  $B$  (stuck) 
It is worth noting that there is a theorem that says:

For every nondeterministic finitestate machine $M$,
there exists a deterministic machine $M'$ such that $M$ and
$M'$ accept the same language.
How to Implement a FSM
The most straightforward way to program a (deterministic) finitestate machine is to use a tabledriven approach. This approach uses a table with one row for each state in the machine, and one column for each possible character. Table[j][k] tells which state to go to from state j on character k. (An empty entry corresponds to the machine getting stuck, which means that the input should be rejected.)
Recall the table for the (deterministic) "integer literal" FSM given above:
+    $\mathrm{digit}$  
$S$  $A$  $A$  $B$ 
$A$  $B$  
$B$  $B$ 
The tabledriven program for a FSM works as follows:
Regular Expressions
Regular expressions provide a compact way to define a language that can be accepted by a finitestate machine. Regular expressions are used in the input to a scanner generator to define each token, and to define things like whitespace and comments that do not correspond to tokens, but must be recognized and ignored.
As an example, recall that a Pascal identifier consists of a letter, followed by zero or more letters or digits. The regular expression for the language of Pascal identifiers is:

letter . (letter  digit)*
  means "or" 
.  means "followed by" 
*  means zero or more instances of 
( )  are used for grouping 
Often, the "followed by" dot is omitted, and just writing two things next to each other means that one follows the other. For example:

letter (letter  digit)*
In fact, the operands of a regular expression should be single characters or the special character epsilon, meaning the empty string (just as the labels on the edges of a FSM should be single characters or epsilon). In the above example, "letter" is used as a shorthand for:

a  b  c  ...  z  A  ...  Z
To understand a regular expression, it is necessary to know the precedences of the three operators. They can be understood by analogy with the arithmetic operators for addition, multiplication, and exponentiation:
Regular Expression Operator 
Analogous Arithmetic Operator 
Precedence 
  plus  lowest precedence 
.  times  middle 
*  exponentiation  highest precedence 
So, for example, the regular expression:

$\mathrm{letter}.\mathrm{letter}  \mathrm{digit}\mathrm{^*}$

$(\mathrm{letter}.\mathrm{letter})  (\mathrm{digit}\mathrm{^*})$
Describe (in English) the language defined by each of the following regular expressions:
Example: Integer LiteralsL
Let's reexamine the example of integer literals for regularexpressions:
An integer literal with an optional sign can be defined in English as:
The corresponding regular expression is:
Note that the regular expression for "one or more digits" is:
i.e., "one digit followed by zero or more digits". Since "one or more" is a common pattern, another operator, +, has been defined to mean "one or more". For example, $\mathrm{digit}$+ means "one or more digits", so another way to define integer literals with optional sign is:
The Language Defined by a Regular ExpressionL
Every regular expression defines a language: the set of strings that match the expression. We will not give a formal definition here, instead, we'll give some examples:
Regular Expression  Corresponding Set of Strings  
$\varepsilon$  $\{$""$\}$  
a  $\{$"a"$\}$  
a.b.c  $\{$"abc"$\}$  
a  b  c  $\{$"a", "b", "c"$\}$  
(a  b  c)*  $\{$"", "a", "b", "c", "aa", "ab", $\ldots$, "bccabb", $\ldots$ $\}$ 
Using Regular Expressions and FSMs to Define a ScannerL
There is a theorem that says that for every regular expression, there is a finitestate machine that defines the same language, and vice versa. This is relevant to scanning because it is usually easy to define the tokens of a language using regular expressions, and then those regular expression can be converted to finitestate machines (which can actually be programmed).
For example, let's consider a very simple language: the language of assignment statements in which the lefthand side is a Pascal identifier (a letter followed by one or more letters or digits), and the righthand side is one of the following:
Token  Regular Expression 
assign  "=" 
id  letter (letter  digit)* 
plus  "$+$" 
times  "$*$" 
equals  "="."=" 
These regular expressions can be converted into the following finitestate machines:
assign:  
id:  
plus:  
times:  
equals: 
Given a FSM for each token, how do we write a scanner? Recall that the goal of a scanner is to find the longest prefix of the current input that corresponds to a token. This has two consequences:
Furthermore, remember that regular expressions are used both to define tokens and to define things that must be recognized and skipped (like whitespace and comments). In the first case a value (the current token) must be returned when the regular expression is matched, but in the second case the scanner should simply start up again trying to match another regular expression.
With all this in mind, to create a scanner from a set of FSMs, we must:
For example, the FSM that recognizes Pascal identifiers must be
modified as follows:
with the following table of actions:
F1: put back 1 char, return ID
And here is the combined FSM for the five tokens (with the actions noted below):
with the following table of actions:
F1: put back 1 char; return assign
F2: put back 1 char; return id
F3: return plus
F4: return times
F5: return equals
We can convert this FSM to code using the tabledriven technique described above, with a few small modifications:
+  *  =  $\mathrm{letter}$  $\mathrm{digit}$  EOF  
$S$  return plus  return times  $B$  $A$  
$A$  put back 1 char; return id 
put back 1 char; return id 
put back 1 char; return id 
$A$  $A$  return id 
$B$ 
put back 1 char; return assign 
put back 1 char; return assign 
return equals 
put back 1 char; return assign 
put back 1 char; return assign 
return assign 
Suppose we want to extend the very simple language of assignment statements defined above to allow both integer and double literals to occur on the righthand sides of the assignments. For example:
would be a legal assignment.
What new tokens would have to be defined? What are the regular expressions, the finitestate machines, and the modified finitestate machines that define them? How would the the "combined" finitestate machine given above have to be augmented?
Defining Syntax
Contents
Recall that the input to the parser is a sequence of tokens (received interactively, via calls to the scanner). The parser:
The output depends on whether the input is a syntactically legal program; if so, then the output is some representation of the program:
We know that we can use regular expressions to define languages (for example, the languages of the tokens to be recognized by the scanner). Can we use them to define the language to be recognized by the parser? Unfortunately, the answer is no. Regular expressions are not powerful enough to define many aspects of a programming language's syntax. For example, a regular expression cannot be used to specify that the parentheses in an expression must be balanced, or that every ``else'' statement has a corresponding ``if''. Furthermore, a regular expression doesn't say anything about underlying structure. For example, the following regular expression defines integer arithmetic involving addition, subtraction, multiplication, and division:
digit+ (("+"  ""  "*"  "/") digit+)*
Example: Simple Arithmetic ExpressionsL
We can write a contextfree grammar (CFG) for the language of (very simple) arithmetic expressions involving only subtraction and division. In English:
Exp$\;\longrightarrow\;$ Exp minus Exp
Exp$\;\longrightarrow\;$ Exp divide Exp
Exp$\;\longrightarrow\;$ lparen Exp rparen
And here is how to understand the grammar:
A more compact way to write this grammar is:
Intuitively, the vertical bar means "or", but do not be fooled into thinking that the righthand sides of grammar rules can contain regular expression operators! This use of the vertical bar is just shorthand for writing multiple rules with the same lefthandside nonterminal.
Formal DefinitionLA CFG is a 4tuple $\left( N, \Sigma, P, S \right)$ where:
Example: Boolean Expressions, Assignment Statements, and If StatementsL
The language of boolean expressions can be defined in English as follows:
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Here is a CFG for a language of very simple assignment statements (only statements that assign a boolean value to an identifier):

Stmt $\;\longrightarrow\;$
id
assign
BExp
semicolon
We can "combine" the two grammars given above, and add two more rules to get a grammar that defines a language of (very simple) ifstatements. In words, an ifstatement is:
Here is the combined contextfree grammar:
Stmt $\;\longrightarrow\;$ if lparen BExp rparen Stmt else Stmt
Stmt $\;\longrightarrow\;$ id assign BExp semicolon
BExp $\;\longrightarrow\;$ true
BExp $\;\longrightarrow\;$ false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Write a contextfree grammar for the language of very simple while loops (in which the loop body only contains one statement) by adding a new production with nonterminal stmt on the lefthand side.
The language defined by a contextfree grammar is the set of strings (sequences of terminals) that can be derived from the start nonterminal. What does it mean to derive something?
Thus, we arrive either at $\varepsilon$ or at a string of terminals. That is how we derive a string in the language defined by a CFG.
Below is an example derivation, using the 4 productions for the grammar of arithmetic expressions given above. In this derivation, we use the actual lexemes instead of the token names (e.g., we use the symbol "" instead of minus).

Exp
$\Longrightarrow$ Exp  Exp$\Longrightarrow$ 1  Exp$\Longrightarrow$ 1  Exp / Exp$\Longrightarrow$ 1  Exp / 2
$\Longrightarrow$ 1  4 / 2
And here is some useful notation:
So, given the above example, we could write: Exp $\stackrel{+}\Longrightarrow$ 1  Exp / Exp.
A more formal definition of what it means for a CFG $G$ to define a language may be stated as follows:
where
There are several kinds of derivations that are important. A derivation is a leftmost derivation if it is always the leftmost nonterminal that is chosen to be replaced. It is a rightmost derivation if it is always the rightmost one.
Parse TreesLAnother way to derive things using a contextfree grammar is to construct a parse tree (also called a derivation tree) as follows:
Here is the example expression grammar given above:
and, using that grammar, here's a parse tree for the string 1  4 / 2:
Stmt $\;\longrightarrow\;$ if lparen BExp rparen Stmt else Stmt
Stmt $\;\longrightarrow\;$ id assign BExp semicolon
BExp $\;\longrightarrow\;$ true
BExp $\;\longrightarrow\;$ false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Question 2: Give a parse tree for the same string.
Syntax for Programming Languages
Contents
While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language.
Contextfree Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the contextfree languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFGlike definition of a language and recognize whether or not a given string of input tokens belongs that language.
For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a contextfree grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations.
We'll caveat the claim that "CFGs are a good way to define programminglanguage syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes.
The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference):
P_{2}: Exp $\longrightarrow$ Exp minus Exp
P_{3}: Exp $\longrightarrow$ Exp divide Exp
P_{4}: Exp $\longrightarrow$ lparen Exp rparen
Consider the string
1  4 / 2
which would be tokenized as
It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence:
Step #2. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ (via production P_{1})
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ divide $Exp$ (via production P_{3})
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide $Exp$ (via production P_{1})
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P_{1})
This derivation sequence corresponds to the parse tree:
However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently:
Step #1. Exp $\Longrightarrow$ Exp divide Exp (via production P_{3})
Step #2. $\;\;\;\;\;\Longrightarrow$ Exp minus Exp divide Exp (via production P_{2})
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus Exp divide Exp (via production P_{1})
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide Exp (via production P_{1})
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit
(via production P_{1})
Which corresponds to the parse tree:
If for grammar $G$ and string $w$ there is:
then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.)
It is worth noting that ambiguity is not a problem for certain applications of ContextFree Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues.
A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is illdefined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 14. These two groupings correspond to expressions with different values: the first tree corresponds to (1(4/2)) and the second tree corresponds to ((14)/2).
A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of BigO complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P_{1} can be applied any number of times before producing the integer literal token.
P_{2}: Exp $\longrightarrow$ intlit
It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators.
To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:
P_{2}: Exp $\longrightarrow$ Term
P_{3}: Term $\longrightarrow$ Term divide Term
P_{4}: Term $\longrightarrow$ Factor
P_{5}: Factor $\longrightarrow$ intlit
P_{6}: Factor $\longrightarrow$ lparen Exp rparen
Now let's try using these new rules to build parse trees for
1  4 / 2
.
First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:
Now we'll try to construct a parse tree that shows the wrong precedence:
This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 532 is equivalent to: ((53)2) and not to: (5(32)).
Draw two parse trees for the expression 532 using the current expression grammar:

exp → exp MINUS exp  term
term → term DIVIDE term  factor
factor → INTLITERAL  LPAREN exp RPAREN
To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars.
The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this).
To write a grammar that correctly expresses operator associativity:
Here's the correct grammar:
Term $\longrightarrow$ Term divide Factor  Factor
Factor $\longrightarrow$ intlit  lparen Exp rparen
And here's the (one and only) parse tree that can be built for 5  3  2 using this grammar:
Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit.
Here's an ambiguous contextfree grammar for this language:
Exp  $\longrightarrow$  Exp plus Exp  Exp minus Exp  Exp times Exp  Exp divide Exp 
  Exp pow Exp  lparen Exp rparen  intlit 
First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:
exp  →  exp PLUS exp   exp MINUS exp   term 
term  →  term TIMES term   term DIVIDE term   factor 
factor  →  factor POW factor   exponent  
exponent  →  INTLITERAL   LPAREN exp RPAREN 
This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):
exp  → exp PLUS term   exp MINUS term   term 
term  → term TIMES factor   term DIVIDE factor   factor 
factor  → exponent POW factor   exponent  
exponent  → INTLITERAL   LPAREN exp RPAREN 
Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., 3**4 is equivalent to: (3)**4, not to (3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.
exp  → exp PLUS term   exp MINUS term   term 
term  → term TIMES factor   term DIVIDE factor   factor 
factor  → exponent POW factor   exponent  
exponent  → MINUS exponent   final  
final  → INTLITERAL   LPAREN exp RPAREN 
Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not:
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative.
Question 2:
Draw a parse tree (using your final grammar for Question 1)
for the expression: true and not true
.
Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language.
A Grammar for a Programming Language
Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some toplevel nonterminals:
Program  $\longrightarrow$  FunctionList 
FunctionList  $\longrightarrow$  Function  Function FunctionList 
With these three productions, the basic shape of the program is already taking form, and some important constraints are put in place:
We can now move on to specifying the syntax of a function: A Function is the word "function", optionally preceded by the word "public", followed by an identifier (signifying the name of the function), followed by an open curly brace, followed by the function body, followed by a closing curly brace:
Function  $\longrightarrow$  public function id lcurly FunctionBody rcurly 
  function id lcurly FunctionBody rcurly 
Note that the above definition is hardly unique. Consider this alternative, but equivalent, definition for a function:
Function  $\longrightarrow$  Access function id lcurly FunctionBody rcurly 
Access  $\longrightarrow$  public  $\varepsilon$ 
Note that the syntax of the language does not provide a list of arguments that the function takes (many popular programming languages expect that a function has a set of parenthesis before the function body).
A function body is a list of zero or more statements:
FunctionBody  $\longrightarrow$  $\varepsilon$ 
  StatementList  
StatementList  $\longrightarrow$  Statement 
  Statement StatementList 
From here, we can define various statement types, and so on.
A benefit of the ContextFree Grammar is that a "subgrammar" can be plugged in to allow complex parse trees to be grown at the nonterminal at which they are rooted. For example, we could allow the rule:
Statement  $\longrightarrow$  id equals Exp 
(where equals is the symbol =, used here for assignment)
This grammar can then make use of the expression grammar that we defined above to build parse trees with arbitrarily complex mathematical expressions on the right hand side of an assignment statement.
SummaryTo understand how a parser works, we start by understanding contextfree grammars, which are used to define the language recognized by the parser. Important terminology includes:
Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both nonempty and possibly empty lists, and for lists both with and without separators and terminators.
SyntaxDirected Definition
Contents
The goal of the parser is to produce output that can be consumed by the next phase of the compiler. Thus far, we have considered using a ContextFree Grammar as a convenient mechanism for defining what strings are syntactically valid programs in the language, and we have discussed techniques for altering the ContextFree Grammar such that the parse trees for those strings match the intended structure of the program. In summary, we've explored how to specify parse trees for a given token stream. However, the parse tree is a relatively cumbersome data structure to work with, made worse by the very alterations we introduced to aid in parsing. For our purposes, we will consider the parse tree to be a purely transitional data structure, useful for structuring the token stream, and then consumed to create the final output to be forwarded on to syntactic analysis.
This reading introduces a method for specifying how to derive a result from a parse tree. This specification is known as a syntaxdirected definition (SDD), while the entire technique is known as syntaxdirected translation (SDT). The general technique of SDT is applicable for a variety of parsing tasks. To introduce SDD, we will explore several examples of mapping parse trees to a variety of different results, but the application we are ultimately interested in for compilers is to translate a parse tree into an AST. In this reading from parse trees abstractsyntax trees, known as a syntaxdirected definition. The name is apt, as the key feature of a syntaxdirected definition is that it intermingles the definition
This involves doing a syntaxdirected translation  translating from a sequence of tokens to some other form, based on the underlying syntax.
A syntaxdirected translation is defined by augmenting the CFG: a translation rule is defined for each production. A translation rule defines the translation of the lefthand side nonterminal as a function of:
Below is the definition of a syntaxdirected translation that translates an arithmetic expression to its integer value. When a nonterminal occurs more than once in a grammar rule, the corresponding translation rule uses subscripts to identify a particular instance of that nonterminal. For example, the rule
has two Exp nonterminals; Exp_{1} means the lefthandside Exp, and Exp_{2} means the righthandside Exp.
Also, the notation xxx.value is used to mean the value associated with token xxx and the notation YYY.trans is used to mean the translation associated with nonterminal YYY.
CFG Production  Translation Rule  
Exp  $\longrightarrow$  Exp plus Term  Exp_{1}.trans = Exp_{2}.trans $+$ Term.trans  
Exp  $\longrightarrow$  Term  Exp.trans = Term.trans  
Term  $\longrightarrow$  Term times Factor  Term_{1}.trans = Term_{2}.trans $\times$ Factor.trans  
Term  $\longrightarrow$  Factor  Term.trans = Factor.trans  
Factor  $\longrightarrow$  intlit  Factor.trans = intlit.value  
Factor  $\longrightarrow$  lparens Exp rparens  Factor.trans = Exp.trans 
Consider evaluating these rules on the input 2 * (4 + 5). The result is the following annotated parse tree:
Example 2L
Consider a language of expressions that use the three operators: +, &&, == using the terminal symbols plus, and, equals, respectively. Integer literals are represented by the same intlit token we've used before, and true and false represent the literals true and false (note that we could have just as well defined a single boollit token that the scanner would populate with either $true$ or $false$).
Let's define a syntaxdirected translation that type checks these expressions; i.e., for typecorrect expressions, the translation will be the type of the expression (either $int$ or $bool$), and for expressions that involve type errors, the translation will be the special value $error$. We'll use the following type rules:
Here is the CFG and the translation rules:
CFG Production  Translation Rule  
Exp  $\longrightarrow$  Exp plus Term  if $($Exp_{2}.trans == $int$$)$ and $($Term.trans == $int$$)$ then
Exp_{1}.trans = $int$ else Exp_{1}.trans = $error$ 

Exp  $\longrightarrow$  Exp and Term  if $($Exp_{2}.trans == $bool$$)$ and $($Term.trans == $bool$$)$ then
Exp_{1}.trans = $bool$ else Exp_{1}.trans = $error$ 

Exp  $\longrightarrow$  Exp equals Term  if $($Exp_{2}.trans == Term.trans$)$ and $($Term.trans $\neq$ $error$$)$ then
Exp_{1}.trans = $bool$ else Exp_{1}.trans = $error$ 

Exp  $\longrightarrow$  Term  Exp.trans = Term.trans  
Term  $\longrightarrow$  true  Term.trans = $bool$  
Term  $\longrightarrow$  false  Term.trans = $bool$  
Term  $\longrightarrow$  intlit  Term.trans = $int$  
Term  $\longrightarrow$  lparens Exp rparens  Term.trans = Exp.trans 
Here's an annotated parse tree for the input (2 + 2) == 4
The following grammar defines the language of base2 numbers:

B $\longrightarrow$ 0
B $\longrightarrow$ 1
B $\longrightarrow$ B 0
B $\longrightarrow$ B 1
Question #1: Define a syntaxdirected translation so that the translation of a binary number is its base 10 value.
Question #2: Illustrate your translation scheme by drawing the parse tree for 1001 and annotating each nonterminal in the tree with its translation.
AbstractSyntax Trees
While SyntaxDirected Definition serves as a general technique for specifying translations of trees for many purposes, one particular application of SDD is of special interest to compiler writers: inducing an AbstractSyntax Tree (AST) from the Parse Tree. The AST can be thought of as a condensed form of the parse tree, elliding incidental syntactic details of the program input.
Students occasionally are surprised that the conceptual workflow of the compiler is to build the parse tree only to immediately discard it in favor of a different tree data structure. In this chapter we will discuss what the AST is, how the AST differs from the parse tree, and how an SDD may be specified to produce an AST from a parse tree. Hopefully, this treatment communicates why the parse tree is a useful part of syntactic analysis, as well as why the AST is a superior internal representation for later phases of the compiler. It's also worth mentioning that in practice, the parse tree and AST are constructed at the same time in a memoryefficient way that keeps much of the parse tree implicit.
The AST vs the Parse TreeL
First, let's identify the ways in which an abstractsyntax tree (AST) differs from a parse tree:
In general, the AST is a better structure for later stages of the compiler because it omits details having to do with the source language, and just contains information about the essential structure of the program.
Below is an example of the parse tree and the AST for the expression
3 * (4 + 2)
using the usual arithmeticexpression grammar that
reflects the precedences and associativities of the operators.
Note that the parentheses are not needed in the AST because the structure of
the AST defines how the subexpressions are grouped.
It's worth noting a couple of relationships between AST and parse tree:
For constructs other than expressions, the compiler writer has some choices when defining the AST  but remember that lists (e.g., lists of declarations lists of statements, lists of parameters) should be flattened, that operators (e.g., "assign", "while", "if") go at internal nodes, not at leaves, and that syntactic details are omitted.

For example:
Input ===== { x = 0; while (x<10) { x = x+1; } y = x*2; }Parse Tree:
Translation Rules That Build an ASTL
To define a syntaxdirected translation so that the translation of an input is the corresponding AST, we first need operations that create AST nodes. Let's use some objectoriented pseudocode, and assume that we have the following class hierarchy:
class ExpNode { } class IntLitNode extends ExpNode { public IntLitNode(int val) {...} } class PlusNode extends ExpNode { public PlusNode( ExpNode e1, ExpNode e2 ) { ... } } class TimesNode extends ExpNode { public TimesNode( ExpNode e1, ExpNode e2 ) { ... } }
Now we can define a syntaxdirected translation for simple arithmetic expressions, so that the translation of an expression is its AST:
CFG Production  Translation rules  
exp  $\longrightarrow$  exp PLUS term  exp_{1}.trans = new PlusNode(exp_{2}, term.trans)  
exp  $\longrightarrow$  term  exp.trans = term.trans  
term  $\longrightarrow$  term TIMES factor  term_{1}.trans = new TimesNode(term_{2}.trans, factor.trans)  
term  $\longrightarrow$  factor  term.trans = factor.trans  
factor  $\longrightarrow$  INTLITERAL  factor.trans = new IntLitNode(INTLITERAL.value)  
factor  $\longrightarrow$  LPARENS exp RPARENS  factor.trans = exp.trans 
Illustrate the syntaxdirected translation defined above by drawing the parse tree for the expression 2 + 3 * 4, and annotating the parse tree with its translation (i.e., each nonterminal in the tree will have a pointer to the AST node that is the root of the subtree of the AST that is the nonterminal's translation).
Parsing
Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root.
There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worstcase running time is O(N^{3}), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for nonambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worstcase runtime O(N^{2}) for nonambiguous grammars, but even a quadratictime algorithm may considered too slow.
Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:
and
LALR(1) grammars are:
In this class, we will learn about three parsing techniques:
CYK Parsing
Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root.
There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worstcase running time is O(N^{3}), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for nonambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worstcase runtime O(N^{2}) for nonambiguous grammars. However, many of these algorithms are rather complicated, so we will look at one that is not so complicated, but always has O(N^{3}) runtime. This algorithm is called the CYK algorithm, for its inventors: Cocke, Younger, and Kasami.
The CYK AlgorithmL
The CYK algorithm can be used to parse any CFG (i.e., to determine, for any input, whether that input is in the language of a given CFG). It essentially builds all of the possible parse trees for all (consecutive) fragments of the input, bottomup. It accepts the input iff it is able to build a complete parse tree: one with the CFG's start nonterminal at the root, and the entire input at the leaves. Chomsky Normal Form
The CYK algorithm requires that the CFG be in CNF: Chomsky normal form. This means that the CFG rules must all be in one of the following forms:
$X$ $\longrightarrow$ $\varepsilon$
are not allowed, unless the left nonterminal is the start nonterminal. If there is a rule
$S$ $\longrightarrow$ $\varepsilon$
(where $S$ is the start nonterminal), then $S$ cannot occur on the righthand side of any rule.Conversion to CNF
Fortunately, it is easy to convert any CFG to CNF. The conversion involves the following steps:
We'll start with a very simple grammar for function calls (the function name, followed by a list of zero or more commaseparated arguments, where each argument can only be an identifier).
$F$  $\longrightarrow$  id ( $A$ ) 
$A$  $\longrightarrow$  $\varepsilon$ 
$A$  $\longrightarrow$  $N$ 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  id , $N$ 
Step 1 (eliminate $\varepsilon$ rules):
This CFG has one rule with epsilon on the righthand side (the second rule). That rule has $A$ on the lefthand side. We'll remove the $\varepsilon$rule by copying the rule that has $A$ on the right (the first rule in the CFG above), and replacing that $A$ with nothing ($\varepsilon$):
$F$  $\longrightarrow$  id ( ) 
Now we have this CFG:
$F$  $\longrightarrow$  id ( $A$ ) 
$F$  $\longrightarrow$  id ( ) 
$A$  $\longrightarrow$  $N$ 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  id , $N$ 
Note that if $A$ occurred more than once on the right of some rule, we would have to make as many copies as needed to get all combinations of $A$ being there and being replaced by $\varepsilon$. For example, if we had this rule:
$X$  $\longrightarrow$  $A$ x $A$ y $A$ 
we'd have to add all of the following new rules, as well as keeping the original rule:
$X$  $\longrightarrow$  $A$ x $A$ y 
$X$  $\longrightarrow$  $A$ x y $A$ 
$X$  $\longrightarrow$  $A$ x y 
$X$  $\longrightarrow$  x $A$ y $A$ 
$X$  $\longrightarrow$  x $A$ y 
$X$  $\longrightarrow$  x y $A$ 
$X$  $\longrightarrow$  x y 
Furthermore, if we have a rule
$S$  $\longrightarrow$  $\varepsilon$ 
where S is the start nonterminal, and there is a grammar rule with S on the right, then we need to do the following:
Step 2 (eliminate unit rules):
Our current CFG looks like this:
$F$  $\longrightarrow$  id ( $A$ ) 
$F$  $\longrightarrow$  id ( ) 
$A$  $\longrightarrow$  $N$ 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  id , $N$ 
It includes one unit rule, a rule with exactly 1 nonterminal (and no terminals) on the right:
$A$  $\longrightarrow$  $N$ 
Because there are no other rules with $A$ on the left, we can remove the unit rule by replacing every occurrence of $A$ on the right side of some rule with $N$. (If there were other rules with $A$ on the left, we would instead make copies of the rules with $A$ on the right, and replace $A$ with $N$ in the copies). Here's our CFG after removing the unit rule:
$F$  $\longrightarrow$  id ( $N$ ) 
$F$  $\longrightarrow$  id ( ) 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  id , $N$ 
Step 3 (fix remaining rules):
The last step in conversion to CNF is to fix the remaining rules so that each one has either a single terminal, or exactly two nonterminals on the right.
Handle rules with terminals on the right
First, we find the rules that have a terminal plus some other symbols on the right:$F$  $\longrightarrow$  id ( $N$ ) 
$F$  $\longrightarrow$  id ( ) 
$N$  $\longrightarrow$  id , $N$ 
For each such terminal t, we add a new rule
$X$  $\longrightarrow$  t 
$F$  $\longrightarrow$  $I$ $L$ $N$ $R$ 
$F$  $\longrightarrow$  $I$ $L$ $R$ 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  $I$ $C$ $N$ 
$I$  $\longrightarrow$  id 
$L$  $\longrightarrow$  ( 
$R$  $\longrightarrow$  ) 
$C$  $\longrightarrow$  , 
Handle rules with more than 2 nonterminals on the right
Now, for every righthand side that has more than two nonterminals, we replace all but the first nonterminal on the right with a new nonterminal, and we add a new rule for the new nonterminal. For example, we handle the rule
$F$  $\longrightarrow$  $I$ $L$ $N$ $R$ 
as follows:
$F$  $\longrightarrow$  $I$ $W$ 
$W$  $\longrightarrow$  $L$ $N$ $R$ 
Then we handle the new rule (with W on the left) as follows:
$W$  $\longrightarrow$  $L$ $X$ 
$X$  $\longrightarrow$  $N$ $R$ 
If we take care of the other two "problem" rules:
$F$  $\longrightarrow$  $I$ $L$ $R$ 
$N$  $\longrightarrow$  $I$ $C$ $N$ 
We finish with the following CFG (in CNF):
$F$  $\longrightarrow$  $I$ $W$ 
$F$  $\longrightarrow$  $I$ $Y$ 
$W$  $\longrightarrow$  $L$ $X$ 
$X$  $\longrightarrow$  $N$ $R$ 
$Y$  $\longrightarrow$  $L$ $R$ 
$N$  $\longrightarrow$  id 
$N$  $\longrightarrow$  $I$ $Z$ 
$Z$  $\longrightarrow$  $C$ $N$ 
$I$  $\longrightarrow$  id 
$L$  $\longrightarrow$  ( 
$R$  $\longrightarrow$  ) 
$C$  $\longrightarrow$  , 
The CFG below is for a grammar of very simple statements. The terminal symbols are id, =, (, ), ++, and read. Convert this CFG to CNF.
$S$  $\longrightarrow$  id = id 
$S$  $\longrightarrow$  id ( ) 
$S$  $\longrightarrow$  id ++ 
$S$  $\longrightarrow$  read ( id ) 
$S$  $\longrightarrow$  $S$ $S$ 
Now that we have converted our CFG to Chomsky Normal form, we can use it to try to parse an input. Let's use the input f(x,y) as an example. The actual sequence of tokens for that input would be: id ( id , id )
To parse an input of length N, we use an NbyN grid. First, we fill in the squares along the diagonal. The k^{th} square contains the nonterminal(s) $X$ such that there is a CFG rule
$X$ $\longrightarrow$ t
and t is the k^{th} token. Here's the 6by6 grid for our example, with the diagonal filled in (and with the rows and columns numbered):
Now we fill in each successive diagonal above the one that's already filled in. When filling in a square, we consider what's already in the squares to the left and below it. For example, to fill in the square in position 2,1, we look at the contents of the square to its left ($I$,$N$) and below it ($L$). If there were a grammar rule with "$I$ $L$" or with "$N$ $L$" on the righthand side, we would put the leftside nonterminal into the square in position 2,1. However, for our example, there are no such grammar rules, so square 2,1 stays empty. The first new square that we can fill in is in position 5,4. That square has $C$ to its left, and $I$, $N$ below it. There is a grammar rule $Z$ $\longrightarrow$ $C$ $N$, so we put a $Z$ in square 5,4. If there had also been a grammar rule $H$ $\longrightarrow$ $C$ $I$, we would also have put an $H$ in square 5,4. Note that we also fill in square 5,5 due to the rule $X$ $\longrightarrow$ $N$ $R$. Here is the grid so far:
The squares in the next diagonal each have two squares to their left and two squares below. We look at pairs of squares: first the square all the way to the left and the square immediately below, then the square immediately to the left and the one two squares down. If either square in the current pair is empty, we go on to the next pair. Each time we look at a pair of nonempty squares, we are looking for a nonterminal $A$ in the square to the left and a nonterminal $B$ in the square that is below, such that there is a grammar rule with $A$ $B$ on the right. If there is such a grammar rule, we fill in the current square with the leftside nonterminal. For example, when trying to fill in the square in position 3,5, we see $I$, $N$ in position 3,3, and $Z$ in position 4,5. There is a grammar rule $N$ $\longrightarrow$ $I$ $Z$, so we put an $N$ is position 3,5. Here is the result of filling in the whole grid:
To understand this process, remember that the CYK algorithm works by building up parse trees for longer and longer fragments of the input. Each box in the upperright part of the grid represents the root of one or more parse trees. The nonterminal(s) in a square is/are the root nonterminal(s) of those trees. The children of the root are the nonterminals in the pairs of squares, one to the left, and one below, that caused the nonterminal to be filled in. The five nonterminals in the upperright part of our grid represent these five parse trees:
Use the statement CFG that was converted to CNF
for CYK SelfStudy #1 to parse the
input x ++ y = z y ++
(i.e.,
build and fill in the grid).
The CYK discovers that an input is in the language of a CFG iff the CFG's start nonterminal appears in the upperright corner of the grid. For our example, F is the start nonterminal, and it does appear in square 1,6, so the example input is in the language of the CFG. The rightmost parse tree shown above is the (unique) parse tree for the example input.
LL(1) Parsing
Contents
Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples on this page we will use sometimes name terminal symbols using single characters (such as: (, ), [, and ]) instead of the token names (lparen, rparen, etc). Also note that in the picture, the top of stack is to the left.
Grammar:
$S$  $\longrightarrow$  $\varepsilon$ $\;\;$ ( $S$ ) $\;\;$ [ $S$ ] 
Parse Table:
(  )  [  ]  EOF  
$S$  ( $S$ )  $\varepsilon$  [ $S$ ]  $\varepsilon$  $\varepsilon$ 
Input seen so far  stack  Action  
(  $S$ EOF  pop, push ( $S$ )  
(  ( $S$ ) EOF  pop, scan  
([  $S$ ) EOF  pop, push [ $S$ ]  
([  [ $S$ ] ) EOF  pop, scan  
([]  $S$ ] ) EOF  pop, push nothing  
([]  ] ) EOF  pop, scan  
([])  ) EOF  pop, scan  
([]) EOF  EOF  pop, scan  
([]) EOF  empty stack: input accepted 
Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):
$S$  $\longrightarrow$  ( $S$ )  [ $S$ ]  ( )  [ ] 
If we try to parse an input that starts with a left paren, we are in trouble! We don't know whether to choose the first production: $S$ $\longrightarrow$ ( $S$ ), or the third one: $S$ $\longrightarrow$ ( ). If the next token is a right paren, we want to push "()". If the next token is a left paren, we want to push the three symbols "($S$)". So here we need two tokens of lookahead.
Draw a picture like the one given above to illustrate what the parser for the grammar:
does on the input: "[[]]".
We need to answer two important questions:
It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule righthand side, then the grammar is LL(1).
Before saying how to build the table we will consider two properties that preclude a contextfree grammar from being LL(1): leftrecursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1).
First, we will introduce one new definition:

A nonterminal $X$ is useless if either:
For case 1:
$S$  $\longrightarrow$  $A$ $B$ 
$A$  $\longrightarrow$  +    $\varepsilon$ 
$B$  $\longrightarrow$  digit  $B$ digit 
$C$  $\longrightarrow$  . $B$ 
For case 2:
$S$  $\longrightarrow$  $X$  $Y$ 
$X$  $\longrightarrow$  ( ) 
$Y$  $\longrightarrow$  ( $Y$ $Y$ ) 
From now on "contextfree grammar" means a grammar without useless nonterms.
Left RecursionL
In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation:

Given two productions of the form:
$A$  $\longrightarrow$  $A$ $\alpha$ 
$$  $\beta$ 
$A$  $\longrightarrow$  $\beta$ $A'$ 
$A'$  $\longrightarrow$  $\alpha$ $A'$  $\varepsilon$ 
Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)).
To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar:
etc.Note that the derived strings are:
That is, they are of the form "$\beta$, followed by zero or more $\alpha$s". The new grammar derives the same set of strings, but the parse trees have a different shape (the single $\beta$ is derived right away, and then the zero or more $\alpha$s):
etc.Consider, for instance, the grammar for arithmetic expressions involving only subtraction:
Exp  $\longrightarrow$  Exp minus Factor  Factor 
Factor  $\longrightarrow$  intliteral  ( Exp ) 
Notice that the first rule ( Exp$\longrightarrow$Exp minus Factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is intliteral, you don't know whether to choose the production Exp$\longrightarrow$Exp minus Factor, or Exp$\longrightarrow$Factor. If the next token is minus, then you should choose Exp$\longrightarrow$Exp minus Factor, but if the next token is EOF, then you should choose Exp$\longrightarrow$Factor.
Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:
Exp  $\longrightarrow$  Factor Exp' 
Exp'  $\longrightarrow$  minus Factor Exp' $\;\;$ $\varepsilon$ 
Factor  $\longrightarrow$  intliteral  ( Exp ) 
Let's consider what the predictive parser built using this grammar does when the input starts with an integer:
Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2  3  4 for the old and the new grammars:
Before eliminating Left Recursion:
After eliminating Left Recursion:
The original parse tree shows the underlying structure of the expression; in particular it groups 2  3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the subexpressions of 2  3  4 at all! Fortunately, we can design a predictive parser to create an abstractsyntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not.
Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one leftrecursive production. Here's a more general rule for removing immediate left recursion:
Note also that there are rules for removing nonimmediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here.
A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose righthand sides have a common prefix. For example, the following grammar is not left factored:
This problem is solved by leftfactoring, as follows:
For example, consider the following productions: $Exp$ $\longrightarrow$ ( $Exp$ ) $\;\;$ ( )
Using the rule defined above, they are replaced by:
$Exp'$ $\longrightarrow$ $Exp$ ) $\;\;$ )
Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix):

For each nonterminal $A$:
Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar.
Here's an example that demonstrates both leftfactoring and immediate leftrecursion removal:
Using the same grammar: exp ::= ( exp )  exp exp  ( ), do left factoring first, then remove immediate left recursion.
Recall: A predictive parser can only be built for an LL(1) grammar. A grammar is not LL(1) if it is:
To build the table, we must must compute FIRST and FOLLOW sets for the grammar.
Ultimately, we want to define FIRST sets for the righthand sides of each of the grammar's productions. To do that, we define FIRST sets for arbitrary sequences of terminals and/or nonterminals, or $\varepsilon$ (since that's what can be on the righthand side of a grammar production). The idea is that for sequence $\alpha$, FIRST($\alpha$) is the set of terminals that begin the strings derivable from $\alpha$, and also, if $\alpha$ can derive $\varepsilon$, then $\varepsilon$ is in FIRST($\alpha$). Using derivation notation: $$\mathrm{FIRST}(\alpha) = \left\{ \; t \; \middle \; \left( t \; \in \Sigma \wedge \alpha \stackrel{*}\Longrightarrow t \beta \right) \vee \left( t = \varepsilon \wedge \alpha \stackrel{*}\Longrightarrow \varepsilon \right) \right\} $$ To define FIRST($\alpha$) for arbitrary $\alpha$, we start by defining FIRST(X), for a single symbol X (a terminal, a nonterminal, or $\varepsilon$):
For example, consider computing FIRST sets for each of the nonterminals in the following grammar:
Exp  $\longrightarrow$  Term Exp' 
Exp'  $\longrightarrow$  minus Term Exp'  $\varepsilon$ 
Term  $\longrightarrow$  Factor Term' 
Term'  $\longrightarrow$  divide Factor Term'  $\varepsilon$ 
Factor  $\longrightarrow$  intlit  lparens Exp rparens 
Here are the FIRST sets (starting with nonterminal factor and working up, since we need to know FIRST(factor) to compute FIRST(term), and we need to know FIRST(term) to compute FIRST(exp)):
Once we have computed FIRST(X) for each terminal and nonterminal X, we can compute FIRST($\alpha$) for every production's righthandside $\alpha$. In general, $\alpha$ will be of the form:
 X_{1} X_{2} ... X_{n}
For the example grammar above, here are the FIRST sets for each production righthand side:
Why do we care about the FIRST($\alpha$) sets? During parsing, suppose that there are two productions:
Production 1:  $A$  $\longrightarrow$  $\alpha$ 
Production 2:  $A$  $\longrightarrow$  $\beta$ 
FOLLOW sets are only defined for single nonterminals. The definition is as follows:

For a nonterminal $A$, FOLLOW($A$) is the set of terminals
that can appear immediately to the right of A in some
partial derivation; i.e., terminal t is in FOLLOW(A) if
$
S \stackrel{+}\Longrightarrow\ldots A \; \mathrm{t} \ldots
$
where t is a terminal. If $A$ can be the rightmost symbol in a derivation,
then EOF is in FOLLOW($A$).
Using notation: $$\mathrm{FOLLOW}(A) = \left\{ \; \mathrm{t} \; \middle \; \left( \mathrm{t} \; \in \Sigma \wedge S \stackrel{+}\Longrightarrow \alpha \; A \; \mathrm{t} \; \beta \right) \vee \left( \mathrm{t} = \mathrm{EOF} \wedge S \stackrel{*}\Longrightarrow \alpha A \right) \right\} $$
Here are some pictures illustrating the conditions under which symbols a, c, and EOF are in the FOLLOW set of nonterminal A:
Figure 1
Figure 2
Figure 2
It is worth noting that:
Here's an example of FOLLOW sets (and the FIRST sets we need to compute them). In this example, nonterminals are uppercase letters, and terminals are lowercase letters.
$S$  $\longrightarrow$  $B$ c $\;\;$ $D$ $B$ 
$B$  $\longrightarrow$  a b $\;\;$ c $S$ 
$D$  $\longrightarrow$  d $\;\;$ $\varepsilon$ 
$\alpha$  FIRST($\alpha$)  FOLLOW($\alpha$)  
$D$  {d, $\varepsilon$}  {a, c}  
$B$  {a, c}  {c, EOF}  
$S$  {a, c, d}  {EOF, c}  Note that FOLLOW($S$) always includes EOF 
Now let's consider why we care about FOLLOW sets:
Here are five grammar productions for (simplified) method headers:
1. methodHeader > VOID ID LPAREN paramList RPAREN 2. paramList > epsilon 3. paramList > nonEmptyParamList 4. nonEmptyParamList > ID ID 5. nonEmptyParamList > ID ID COMMA nonEmptyParamListQuestion 1: Compute the FIRST and FOLLOW sets for the three nonterminals, and the FIRST sets for each production righthand side.
Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the topofstack that has more than one production.
How to Build Parse Tables
Recall that the form of the parse table is:
Table entry[$X$,a] is either empty (if having $X$ on top of stack and having a as the current token means a syntax error) or contains the righthand side of a production whose lefthandside nonterminal is $X$  that righthand side is what should be pushed.
To build the table, we fill in the rows one at a time for each nonterminal $X$ as follows:
 for each production $X \longrightarrow \alpha$:
 for each terminal t in FIRST($\alpha$):
 put $\alpha$ in Table[$X$,t]
if $\epsilon \in$ FIRST($\alpha$) then:
 for each terminal t in FOLLOW($X$):
 put $\alpha$ in Table[$X$,t]
The grammar is not LL(1) iff there is more than one entry for any cell in the table.
$S$  $\longrightarrow$  $B$ c $\;\;$ $D$ $B$ 
$B$  $\longrightarrow$  a b $\;\;$ c $S$ 
$D$  $\longrightarrow$  d $\;\;$ $\varepsilon$ 
$\alpha$  FIRST($\alpha$)  FOLLOW($\alpha$)  
$D$  {d, $\varepsilon$}  {a, c}  
$B$  {a, c}  {c, EOF}  
$S$  {a, c, d}  {EOF, c}  
$B$ c  {a, c}    
$D$ $B$  {d, a, c}    
a b  {a}    
c $S$  {c}    
d  {d}    
$\varepsilon$  {$\varepsilon$}   
Here's how we filled in this much of the table:
Finish filling in the parse table given above.
How to Code a Predictive Parser
Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a tabledriven parser with an explicit stack. Here's pseudocode for a tabledriven predictive parser:
Stack.push(EOF); Stack.push(startnonterminal); currToken = scan(); while (! Stack.empty()) { topOfStack = Stack.pop(); if (isNonTerm(topOfStack)) { // top of stack symbol is a nonterminal p = table[topOfStack, currToken]; if (p is empty) report syntax error and quit else { // p is a production's righthand side push p, one symbol at a time, from right to left } } else { // top of stack symbol is a terminal if (topOfStack == currToken) currToken = scan(); else report syntax error and quit } }
Here is a CFG (with rule numbers):
S > epsilon (1)  X Y Z (2) X > epsilon (3)  X S (4) Y > epsilon (5)  a Y b (6) Z > c Z (7)  d (8)
Question 1(a): Compute the First and Follow Sets for each nonterminal.
Question 1(b): Draw the Parse Table.
Shift/Reduce Parsing
Contents
There are several different kinds of bottomup parsing. We will discuss an approach called LR parsing, which includes SLR, LALR, and LR parsers. LR means that the input is scanned lefttoright, and that a rightmost derivation, in reverse, is constructed. SLR means "simple" LR, and LALR means "lookahead" LR.
Every SLR(1) grammar is also LALR(1), and every LALR(1) grammar is also LR(1), so SLR is the most limited of the three, and LR is the most general. In practice, it is pretty easy to write an LALR(1) grammar for most programming languages (i.e., the "power" of an LR parser isn't usually needed). A disadvantage of LR parsers is that their tables can be very large. Therefore, parser generators like Yacc and Java Cup produce LALR(1) parsers.
Let's start by considering the advantages and disadvantages of the LR parsing family:
Advantages
Recall that topdown parsers use a stack. The contents of the stack represent a prediction of what the rest of the input should look like. The symbols on the stack, from top to bottom, should "match" the remaining input, from first to last token. For example, earlier, we looked at the example grammar
Grammar:
$S$  $\longrightarrow$  $\varepsilon$ ( $S$ ) [ $S$ ] 
and parsed the input string ([]). At one point during the parse, after the first parenthesis has been consumed, the stack contains
[ S ] ) EOF(with the topofstack at the left). This is a prediction that the remaining input will start with a '[', followed by zero or more tokens matching an S, followed by the tokens ']' and ')', in that order, followed by endoffile.
Bottomup parsers also use a stack, but in this case, the stack represents a summary of the input already seen, rather than a prediction about input yet to be seen. For now, we will pretend that the stack symbols are terminals and nonterminals (as they are for predictive parsers). This isn't quite true, but it makes our introduction to bottomup parsing easier to understand.
A bottomup parser is also called a "shiftreduce" parser because it performs two kind of operations, shift operations and reduce operations. A shift operation simply shifts the next input token from the input to the top of the stack. A reduce operation is only possible when the top N symbols on the stack match the righthand side of a production in the grammar. A reduce operation pops those symbols off the stack and pushes the nonterminal on the lefthand side of the production.
One way to think about LR parsing is that the parse tree for a given input is built, starting at the leaves and working up towards the root. More precisely, a reverse rightmost derivation is constructed.
Recall that a derivation (using a given grammar) is performed as follows:
A rightmost derivation is one in which the rightmost nonterminal is always the one chosen.
Rightmost Derivation ExampleCFG
$E$  $\longrightarrow$  $E$ + $T$ 
  $T$  
$T$  $\longrightarrow$  $T$ * $F$ 
  $F$  
$F$  $\longrightarrow$  id 
  ( $E$ ) 
Rightmost derivation

$E$
In this example, the nonterminal that is chosen at each step is in red, and each derivation step is numbered. The corresponding bottomup parse is shown below by showing the parse tree with its edges numbered to show the order in which the tree was built (e.g., the first step was to add the nonterminal F as the parent of the leftmost parsetree leaf "id", and the last step was to combine the three subtrees representing "id", "+", and "id * id" as the children of a new root node E).
Note that both the rightmost derivation and the bottomup parse have 8 steps. Step 1 of the derivation corresponds to step 8 of the parse; step 2 of the derivation corresponds to step 7 of the parse; etc. Each step of building the parse tree (adding a new nonterminal as the parent of some existing subtrees) is called a reduction (that's where the "reduce" part of "shiftreduce" parsing comes from).
Basic LR Parsing AlgorithmL
All LR parsers use the same basic algorithm:

Based on:
The difference between SLR, LALR, and LR parsers is in the tables that they use. Those tables use different techniques to determine when to do a reduce step, and, if there is more than one grammar rule with the same righthand side, which lefthandside nonterminal to push.
Here are the steps the parser would perform using the grammar of arithmetic expressions with + and * given above, if the input is
id + id * id
Stack  Input  Action  
id + id * id  shift (id)  
id  + id * id  reduce by $F$ $\longrightarrow$ id  
$F$  + id * id  reduce by $T$ $\longrightarrow$ $F$  
$T$  + id * id  reduce by $E$ $\longrightarrow$ $T$  
$E$  + id * id  shift(+)  
$E$ +  id * id  shift(id)  
$E$ + id  * id  reduce by $F$ $\longrightarrow$ id  
$E$ + $F$  * id  reduce by $T$ $\longrightarrow$ $F$  
$E$ + $T$  * id  shift(*)  
$E$ + $T$ *  id  shift(id)  
$E$ + $T$ * id  reduce by $F$ $\longrightarrow$ id  
$E$ + $T$ * $F$  reduce by $T$ $\longrightarrow$ $T$ * $F$  
$E$ + $T$  reduce by $E$ $\longrightarrow$ $E$ + $T$  
$E$  accept 
(NOTE: the top of stack is to the right; the reverse rightmost derivation is formed by concatenating the stack with the remaining input at each reduction step)
As mentioned above, the symbols pushed onto the parser's stack are not actually terminals and nonterminals. Instead, they are states, that correspond to a finitestate machine that represents the parsing process (more on this soon).
All LR Parsers use two tables: the action table and the goto table. The action table is indexed by the topofstack symbol and the current token, and it tells which of the four actions to perform: shift, reduce, accept, or reject. The goto table is used during a reduce action as explained below.
Above we said that a shift action means to push the current token onto the stack. In fact, we actually push a state symbol onto the stack. Each "shift" action in the action table includes the state to be pushed.
Above, we also said that when we reduce using the grammar rule A → alpha, we pop alpha off of the stack and then push A. In fact, if alpha contains N symbols, we pop N states off of the stack. We then use the goto table to know what to push: the goto table is indexed by state symbol t and nonterminal A, where t is the state symbol that is on top of the stack after popping N times.
Here's pseudo code for the parser's main loop:
push initial state s_{0} a = scan() do forever t = topofstack (state) symbol switch action[t, a] { case shift s: push(s) a = scan() case reduce by A → alpha: for i = 1 to length(alpha) do pop() end t = topofstack symbol push(goto[t, A]) case accept: return( SUCCESS ) case error: call the error handler return( FAILURE ) } end do
Remember, all LR parsers use this same basic algorithm. As mentioned above, for all LR parsers, the states that are pushed onto the stack represent the states in an underlying finitestate machine. Each state represents "where we might be" in a parse; each "place we might be" is represented (within the state) by an item. What's different for the different LR parsers is:
SyntaxDirected Translation (SDT)
Our journey through the syntactic analysis has, thus far, gone as follows: We first showed how to define the parse tree, then explored how to actually build it. Likewise, we can define the AST through a syntaxdirected definition, so we now turn to exploring how to actually build the AST using the two parsing strategies (topdown and bottomup) that we presented. The actual task of producing the AST according to the syntaxdirected definition is known as syntaxdirected translation (SDT).
Now we consider how to implement a syntaxdirected translation using a predictive parser. It is not obvious how to do this, since the predictive parser works by building the parse tree topdown, while the syntaxdirected translation needs to be computed bottomup (since the translations attached to LHS symbol translations depend on the translations attached to RHS symbols). Of course, we could design the parser to actually build the parse tree (topdown), then use the translation rules to build the translation (bottomup). However, that would not be very efficient.
Instead, we avoid explicitly building the parse tree by giving the parser a second stack called the semantic stack:
So what actually happens is that the action for a grammar rule $X \longrightarrow Y_1 \; Y_2 \; \ldots \; Y_n$ is pushed onto the (normal) stack when the derivation step $X \longrightarrow Y_1 Y_2 \ldots Y_n$ is made, but the action is not actually performed until complete derivations for all of the $Y$s have been carried out.
For example, consider the following syntaxdirected translation for the language of balanced parentheses and square brackets. The translation of a string in the language is the number of parenthesis pairs in the string.
CFG  Transition Rules  
$Exp$  $\longrightarrow$  $\varepsilon$  $Exp$.trans = 0  
  ( $Exp$ )  $Exp_1$.trans = $Exp_2$.trans + 1  
  [ $Exp$ ]  $Exp_1$.trans = $Exp_2$.trans 
The first step is to replace the transition rules with translation actions. Each action must:
CFG  Transition Actions  
$Exp$  $\longrightarrow$  $\varepsilon$  push 0;  
  ( $Exp$ )  exp2trans = pop() ; push(exp2trans + 1)  
  [ $Exp$ ]  exp2trans = pop() ; push(exp2trans) 
Next, each action is represented by a unique action number, and those action numbers become part of the grammar rules:
CFG with Embedded Actions  
$Exp$  $\longrightarrow$  $\varepsilon$ #1 
  ( $Exp$ ) #2  
  [ $Exp$ ] #3 
where
#1 is  push 0; 
#2 is  exp2trans = pop() ; push(exp2trans + 1) 
#3 is  exp2trans = pop() ; push(exp2trans) 
Note that since action #3 just pushes exactly what is popped, that action is redundant, and it is not necessary to have any action associated with the third grammar rule. Here's a picture that illustrates what happens when the input "([])" is parsed (assuming that we have removed action #3):
Input so far  Stack  Semantic Stack  Action  
(  $Exp$  pop, push ( exp ) #2  
(  ( $Exp$ ) 2  pop, scan  
([  $Exp$ ) 2  pop, push "[ exp ]"  
([  [ $Exp$ ] ) 2  pop, scan  
([]  $Exp$ ] ) 2  pop, push $\varepsilon$ 1  
([]  1 ] ) 2  pop, do action 1  
([]  ] ) 2  0  pop, scan  
([]  ) 2  0  pop, scan  
([])  2  0  pop, do action 2  
([])  1  pop, scan  
([])  empty stack, accept 
translation of input = 1
In the example above, there is no grammar rule with more than one nonterminal on the righthand side. If there were, the translation action for that rule would have to do one pop for each righthandside nonterminal. For example, suppose we are using a grammar that includes the rule:
CFG Rule: methodBody > { varDecls stmts } Translation Rule: methodBody.trans = varDecls.trans + stmts.trans Translation Action: stmtsTrans = pop(); declsTrans = pop(); push( stmtsTrans + declsTrans ); CFG rule with Action: methodBody > { varDecls stmts } #1 #1: stmtsTrans = pop(); declsTrans = pop(); push( stmtsTrans + declsTrans );
Note that the righthandside nonterminals' translations are popped from the semantic stack righttoleft. That is because the predictive parser does a leftmost derivation, so the varDecls nonterminal gets "expanded" first; i.e., its parse tree is created before the parse tree for the stmts nonterminal. This means that the actions that create the translation of the varDecls nonterminal are performed first, and thus its translation is pushed onto the semantic stack first.
Another issue that has not been illustrated yet arises when a lefthandside nonterminal's translation depends on the value of a righthandside terminal. In that case, it is important to put the action number before that terminal symbol when incorporating actions into grammar rules. This is because a terminal symbol's value is available during the parse only when it is the "current token". For example, if the translation of an arithmetic expression is the value of the expression:
CFG Rule:  Factor $\longrightarrow$ intlit 
Translation Rule:  factor.trans = intlit.value 
Translation Action:  push ( intlit.value ) 
CFG rule with Action:  Factor $\longrightarrow$ #1
intlit // action BEFORE terminal
#1: push( currToken.value )

For the following grammar, give (a) translation rules, (b) translation actions with numbers, and (c) a CFG with action numbers, so that the translation of an input expression is the value of the expression. Do not worry about the fact that the grammar is not LL(1).
Exp  $\longrightarrow$  Exp plus Term 
$$  Exp minus Term  
$$  Term  
Term  $\longrightarrow$  Term times Factor 
$$  Term divide Factor  
$$  Factor  
Factor  $\longrightarrow$  intlit 
$$  lparen Exp rparen 
Handling NonLL(1) Grammars
Recall that a nonLL(1) grammar must be transformed to an equivalent LL(1) grammar if it is to be parsed using a predictive parser. Recall also that the transformed grammar usually does not reflect the underlying structure the way the original grammar did. For example, when left recursion is removed from the grammar for arithmetic expressions, we get grammar rules like this:
CFG  
$Exp$  $\longrightarrow$  $Term$ $Exp'$ 
$Exp'$  $\longrightarrow$  $\varepsilon$ 
  + $Term$ $Exp'$ 
For example:

NonLL(1) Grammar Rules With Actions
$Exp$  $\longrightarrow$  $Exp$ + $Term$ #1 
  $Term$  
$Term$  $\longrightarrow$  $Term$ * $Factor$ #2 
  $\mathit{Factor}$ 
#1 is  TTrans = pop() ; ETrans = pop() ; push(ETrans + TTrans); 
#2 is  FTrans = pop() ; TTrans = pop() ; push(TTrans * FTrans); 
$Exp$  $\longrightarrow$  $Term$ $Exp'$ 
$Exp'$  $\longrightarrow$  + $Term$ #1 $Exp'$ 
  $\varepsilon$  
$\mathit{Term}$  $\longrightarrow$  $\mathit{Factor}$ $\mathit{Term}'$ 
$\mathit{Term}'$  $\longrightarrow$  * $\mathit{Factor}$ 2 $\mathit{Term}'$ 
  $\varepsilon$ 
Transform the grammar rules with actions that you wrote for the "Test Yourself #1" exercise to LL(1) form. Trace the actions of the predictive parser on the input 2 + 3 * 4.
In contrast to topdown SDT, performing SDT bottomup SDT is fairly straightforward, since RHS symbol translations will naturally be available at the time of the RHS symbol translations. Consequently, we will not cover the implementation in very much depth. It is sufficient to simply note that when a single grammar production has been reduced according to the grammar, we can simply fire the corresponding rule from the SDD, referencing the position of the RHS symbol according to it's appearance in the production.
A syntaxdirected translation is used to define the translation of a sequence of tokens to some other value, based on a CFG for the input. A syntaxdirected translation is defined by associating a translation rule with each grammar rule. A translation rule defines the translation of the lefthandside nonterminal as a function of the righthandside nonterminals' translations, and the values of the righthandside terminals. To compute the translation of a string, build the parse tree, and use the translation rules to compute the translation of each nonterminal in the tree, bottomup; the translation of the string is the translation of the root nonterminal.
There is no restriction on the type of a translation; it can be a simple type like an integer, or a complex type list an abstractsyntax tree.
To implement a syntaxdirected translation using a predictive parser, the translation rules are converted to actions that manipulate the parser's semantic stack. Each action must pop all righthandside nonterminals' translations from the semantic stack, then compute and push the lefthandside nonterminal's translation. Next, the actions are incorporated (as action numbers) into the grammar rules. Finally, the grammar is converted to LL(1) form (treating the action numbers just like terminal or nonterminal symbols).
Symbol Tables
Contents
The parser ensures that the input program is syntactically correct, but there are other kinds of correctness that it cannot (or usually does not) enforce. For example:
The next phase of the compiler after the parser, sometimes called the static semantic analyzer is in charge of checking for these kinds of errors. The checks can be done in two phases, each of which involves traversing the abstractsyntax tree created by the parser:
The purpose of the symbol table is to keep track of names declared in the program. This includes names of classes, fields, methods, and variables. Each symbol table entry associates a set of attributes with one name; for example:
A language's scoping rules tell you when you're allowed to reuse a name, and how to match uses of a name to the corresponding declaration. In most languages, the same name can be declared multiple times under certain circumstances. In C++ you can use the same name in more than one declaration if the declarations involve different kinds of names. For example, you can use the same name for a class, a field of the class, and a local variable of a member function (this is not recommended, but it is legal):
class Test{ private: int Test; void func() { double Test; // Could have been declared int } };
In C++ functions (and Java methods), you can also use the same name more than once in a scope as long as the number and/or types of parameters are unique (this is called overloading).
In C and C++, but not in Java, you can declare variables with the same name in different blocks. A block is a piece of code inside curly braces; for example, in an if or a loop. The following is a legal C or C++ function, but not a legal Java method:
void f(int k) { int x = 0; /* x is declared here */ while (...) { int x = 1; /* another x is declared here */ ... if (...) { float x = 5.5; /* and yet another x is declared here! */ ... } } }As mentioned above, the scopinge rules of a language determine which declaration of a named object corresponds to each use. C, C++, and Java use what is called static scoping; that means that the correspondence between uses and declarations is made at compile time. C and C++ use the "most closely nested" rule to match nested declarations to their uses: a use of variable x matches the declaration in the most closely enclosing scope such that the declaration precedes the use. In C and C++, there is one, outermost scope that includes the names of the global variables (the variables that are declared outside the functions) and the names of the functions that are not part of any class. Each function has one or more scopes. Both C and C++ have one scope for the parameters and the "toplevel" declarations, plus one for each block in the function (delimited by curly braces). In addition, C++ has a scope for each for loop: in C++ (but not in C) you can declare variables in the forloop header.
In the example given above, the outermost scope includes just the name "f", and function f itself has three scopes:
Question 1: Consider the names declared in the following code. For each, determine whether it is legal according to the rules used in Java.
class animal { // methods void attack(int animal) { for (int animal=0; animal<10; animal++) { int attack; } } int attack(int x) { for (int attack=0; attack<10; attack++) { int animal; } } void animal() { } // fields double attack; int attack; int animal; }
Question 2: Consider the following C++ code. For each use of a name, determine which declaration it corresponds to (or whether it is a use of an undeclared name).
int k=10, x=20; void foo(int k) { int a = x; int x = k; int b = x; while (...) { int x; if (x == k) { int k, y; k = y = x; } if (x == k) { int x = y; } } }
Not all languages use static scoping. Lisp, APL, Perl (depending on how a variable is declared), and Snobol use what is called dynamic scoping. A use of a variable that has no corresponding declaration in the same function corresponds to the declaration in the mostrecentlycalled still active function. For example, consider the following code:
void main() { f1(); f2(); } void f1() { int x = 10; g(); } void f2() { String x = "hello"; f3(); g(); } void f3() { double x = 30.5; } void g() { print(x); }Under dynamic scoping this program outputs "10 hello". The first call to g comes from f1, whose copy of x has value 10. The next call to g comes from f2. Although f3 is called by f2 before it calls g, the call to f3 is not active when g is called; therefore, the use of x in g matches the declaration in f2, and "hello" is printed.
Assuming that dynamic scoping is used, what is output by the following program?
void main() { int x = 0; f1(); g(); f2(); } void f1() { int x = 10; g(); } void f2() { int x = 20; f1(); g(); } void g() { print(x); }
It is generally agreed that dynamic scoping is a bad idea; it can make a program very difficult to understand, because a single use of a variable can correspond to many different declarations (with different types)! The languages that use dynamic scoping are all old languages; recently designed languages all use static scoping.
Another issue that is handled differently by different languages is whether names can be used before they are defined. For example, in Java, a method or field name can be used before the definition appears, but this is not true for a variable:
class Test { void f() { val = 0; // field val has not yet been declared  OK g(); // method g has not yet been declared  OK x = 1; // variable x has not yet been declared  ERROR! int x; } void g() {} int val; }
In what follows, we will assume that we are dealing with a language that:
In addition to the assumptions listed at the end of the previous section, we will assume that:
 Look up a name in the current scope only (to check if it is multiply declared).
 Look up a name in the current and enclosing scopes (to check for a use of an undeclared name, and to link a use with the corresponding symboltable entry).
 Insert a new name into the symbol table with its attributes.
 Do what must be done when a new scope is entered.
 Do what must be done when a scope is exited.
 the symbol name
 its type
 the nesting level of its declaration
Method 1: List of Hashtables
The idea behind this approach is that the symbol table consists of a list of hashtables, one for each currently visible scope. When processing a scope S, the structure of the symbol table is:
void f(int a, int b) { double x; while (...) { int x, y; ... } void g() { f(); }After processing the declarations inside the while loop, the symbol table looks like this:
Here are the operations that need to be performed on scope entry/exit, and to process a declaration/use:
 On scope entry: increment the current level number and add a new empty hashtable to the front of the list.
 To process a declaration of x: look up x in the first table in the list. If it is there, then issue a "multiply declared variable" error; otherwise, add x to the first table in the list.
 To process a use of x: look up x starting in the first table in the list; if it is not there, then look up x in each successive table in the list. If it is not in any table then issue an "undeclared variable" error.
 On scope exit, remove the first table from the list and decrement the current level number.
There are several factors involved in the time required for each operation:
 Scope entry: time to initialize a new, empty hashtable; this is probably proportional to the size of the hashtable.
 Process a declaration: using hashing, constant expected time (O(1)).
 Process a use: using hashing to do the lookup in each table in the list, the worstcase time is O(depth of nesting), when every table in the list must be examined.
 Scope exit: time to remove a table from the list, which should be O(1) if garbage collection is ignored.
For all three questions below, assume that the symbol table is implemented using a list of hashtables.
Question 1: Recall that Java does not allow the same name to be used for a local variable of a method, and for another local variable declared inside a nested scope in the method body. Even with this restriction, it is not a good idea to put all of a method's local variables (whether they are declared at the beginning of the method, or in some nested scope within the method body) in the same table. Why not?
Question 2: C++ does not use exactly the scoping rules that we have been assuming. In particular, C++ does allow a function to have both a parameter and a local variable with the same name (and any uses of the name refer to the local variable).
Consider the following code. Draw the symbol table as it would be after processing the declarations in the body of f under:
 the scoping rules we have been assuming
 C++ scoping rules
void g(int x, int a) { } void f(int x, int y, int z) { int a, b, x; ... }
Question 3: Assume that a symboltable entry includes the "kind" of the declared name as well as the other attributes assumed above (if the same name is declared as two different "kinds" in one scope, there would be one entry with a list of "kinds"). Also assume that we can tell (from context), for each use of a name, what "kind" of name it is supposed to be.
Which of the four operations (scope entry, process a declaration, process a use, scope exit) described above would change (and how would it change) if Java rules for name reuse were used instead of C++ rules (i.e., if the same name can be used within one scope as long as the uses are for different kinds of names, and if the same name cannot be used for more than one variable declaration in nested scopes)?
Method 2: Hashtable of Lists
The idea behind this approach is that when processing a scope S, the structure of the symbol table is:
There is just one big hashtable, containing an entry for each variable for which there is some declaration in scope S or in a scope that encloses S. Associated with each variable is a list of symboltable entries. The first list item corresponds to the most closely enclosing declaration; the other list items correspond to declarations in enclosing scopes.
For example, given this code:
void f(int a) { double x; while (...) { int x, y; ... } void g() { f(); }After processing the declarations inside the while loop, the symbol table looks like this:
Note that the levelnumber attribute stored in each list item enables us to determine whether the most closely enclosing declaration was made in the current scope or in an enclosing scope.
Here are the operations that need to be performed on scope entry/exit, and to process a declaration/use:
 On scope entry: increment the current level number.
 To process a declaration of x: look up x in the symbol table. If x is there, fetch the level number from the first list item. If that level number = the current level then issue a "multiply declared variable" error; otherwise, add a new item to the front of the list with the appropriate type and the current level number.
 To process a use of x: look up x in the symbol table. If it is not there, then issue an "undeclared variable" error.
 On scope exit, scan all entries in the symbol table, looking at the first item on each list. If that item's level number = the current level number, then remove it from its list (and if the list becomes empty, remove the entire symboltable entry). Finally, decrement the current level number.
 Scope entry: time to increment the level number, O(1).
 Process a declaration: using hashing, constant expected time (O(1)).
 Process a use: using hashing, constant expected time (O(1)).
 Scope exit: time proportional to the number of names in the symbol table (or perhaps even the size of the hashtable if no auxiliary information is maintained to allow iteration through the nonempty hashtable buckets).
Assume that the symbol table is implemented using a hashtable of lists. Draw pictures to show how the symbol table changes as the declarations in each scope in the following code is processed.
void g(int x, int a) { double d; while (...) { int d, w; double x, b; if (...) { int a,b,c; } } while (...) { int x,y,z; } }