Table of Contents
- Overview
- Scanning
- Defining Syntax
- Syntax for Programming Languages
- Syntax-Directed Definition
- Abstract-Syntax Trees
- Parsing
- CYK Parsing
- LL(1) Parsing
- Shift/Reduce Parsing
- Syntax-Directed Translation (SDT)
- Symbol Tables
- Semantic Analysis
- Dataflow
- Dataflow Solving
- Types
- Runtime Environments
- Variable Access
- Parameter Passing
- Code Generation
- Welcome
- Introduction
- The Scanner
- The Parser
- The Semantic Analyzer
- The Intermediate Code Generator
- The Optimizer
- The Code Generator
- A recognizer of some source language S.
- A translator (of programs written in S into programs written in some object or target language T).
- A program written in some host language H
- The scanner groups the characters into lexemes (sequences of characters that "go together").
- Each lexeme corresponds to a token; the scanner forwards each token (plus maybe some additional information, such as a position within at which the token was found in the input file) to the parser.
- Groups tokens into "grammatical phrases", discovering the underlying structure of the source program.
- Finds syntax errors. For example, the source code
5 position = ;
intlit id assign semicolonAll are legal tokens, but that sequence of tokens is erroneous. - Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared.
- Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.
- The interior nodes of the tree are operators.
- A node's children are its operands.
- Each subtree forms a "logical unit", e.g., the subtree with * at its root shows that because multiplication has higher precedence than addition, this operation must be performed as a unit (not initial+rate).
- The assembler transforms an assembly code file (such as one produced by the compiler) into a relocatable, which is either an object file (typically given the .o extension) or a library (typically given the .so or .a extension, for a dynamically-linked or statically-linked library, respectively).
- The linker merges the various object files and static libraries output by the assembler into a single executable.
- Overview
- Finite-State Machines
- Regular Expressions
- Using Regular Expressions and Finite-State Machines to Define a Scanner
- A compiler recognizes legal programs in some (source) language.
- A finite-state machine recognizes legal strings in some language.
- Nodes are states.
- Edges (arrows) are transitions. Each edge should be labeled with a single character. In this example, we've used a single edge labeled "letter" to stand for 52 edges labeled 'a', 'b', $\ldots$, 'z', 'A', $\ldots$, 'Z'. (Similarly, the label "letter,digit" stands for 62 edges labeled 'a',...'Z','0',...'9'.)
- S is the start state; every FSM has exactly one (a standard convention is to label the start state "S").
- A is a final state. By convention, final states are drawn using a double circle, and non-final states are drawn using single circles. A FSM may have more than one final state.
- The FSM starts in its start state.
- If there is a edge out of the current state whose label matches the current input character, then the FSM moves to the state pointed to by that edge, and "consumes" that character; otherwise, it gets stuck.
- The finite-state machine stops when it gets stuck or when it has consumed all of the input characters.
- The entire string is consumed (the machine did not get stuck), and
- the machine ends in a final state.
x
tmp2
XyZzy
position27
123
a?
13apples
- a sequence of one or more letters and/or digits,
- followed by an at-sign,
- followed by one or more letters,
- followed by zero or more extensions.
- An extension is a dot followed by one or more letters.
- $Q$ is a finite set of states ($\{\mathrm{S,A,B}\}$ in the above example).
- $\Sigma$ (an uppercase sigma) is the alphabet of the machine, a finite set of characters that label the edges ($\{+,-,0,1,...,9\}$ in the above example).
- $q$ is the start state, an element of $Q$ ($\mathrm{S}$ in the above example).
- $F$ is the set of final states, a subset of $Q$ ({B} in the above example).
- $\delta$ is the state transition relation: $Q \times \Sigma \rightarrow Q$ (i.e., it is a function that takes two arguments -- a state in $Q$ and a character in $\Sigma$ -- and returns a state in $Q$).
- Deterministic:
- No state has more than one outgoing edge with the same label.
- Non-Deterministic:
- States may have more than one outgoing edge with same label.
- Edges may be labeled with $\varepsilon$ (epsilon), the empty string. The FSM can take an $\varepsilon$-transition without looking at the current input character.
- Have a variable named state, initialized to S (the start state).
- Repeat:
- read the next character from the input
- use the table to assign a new value to the state variable
- $\mathrm{digit} \; | \; \mathrm{letter} \; \mathrm{letter}$
- $\mathrm{digit} \; | \; \mathrm{letter} \; \mathrm{letter}$*
- $\mathrm{digit} \; | \; \mathrm{letter}$*
- ID + ID
- ID * ID
- ID == ID
- The scanner sometimes needs to look one or more characters
beyond
the last character of the current token, and then needs to "put
back" those characters so that the next time the scanner is called
it will have the correct current character.
For example, when scanning a program written in the simple
assignment-statement language defined above, if the input is
"==", the scanner should return the EQUALS token, not two ASSIGN
tokens. So if the current character is "=", the scanner must
look at the next character to see whether it is another "="
(in which case it will return EQUALS), or is some other character
(in which case it will put that character back and return ASSIGN).
- It is no longer correct to run the FSM program until the machine gets stuck or end-of-input is reached, since in general the input will correspond to many tokens, not just a single token.
- modify the machines so that a state can have an associated action to "put back N characters" and/or to "return token XXX",
- we must combine the finite-state machines for all of the tokens in to a single machine, and
- we must write a program for the "combined" machine.
- The table will include a column for end-of-file as well as for all possible characters (the end-of-file column is needed, for example, so that the scanner works correctly when an identifier is the last token in the input).
- Each table entry may include an action as well as or instead of a new state.
- Instead of repeating "read a character; update the state variable" until the machine gets stuck or the entire input is read, the code will repeat: "read a character; perform the action and/or update the state variable" (eventually, the action will be to return a value, so the scanner code will stop, and will start again in the start state next time it is called).
- Overview
- Example: Simple Arithmetic Expressions
- Formal Definition
- Example: Boolean Expressions, Assignment Statements, and If Statements
- Self-Study #1
- The Language Defined by a CFG
- Groups the tokens into "grammatical phrases".
- Discovers the underlying structure of the program.
- Finds syntax errors.
- Perhaps also performs some actions to find other kinds of errors.
- an abstract-syntax tree (maybe + a symbol table),
- or intermediate code,
- or object code.
- An integer is an arithmetic expression.
- If exp1 and exp2 are arithmetic expressions,
then so are the following:
- exp1 - exp2
- exp1 / exp2
- ( exp1 )
- The grammar has five terminal symbols: intliteral minus divide lparen rparen. The terminals of a grammar used to define a programming language are the tokens returned by the scanner.
- The grammar has one nonterminal: Exp (note that a single name, Exp, is used instead of Exp1 and Exp2 as in the English definition above).
- The grammar has four productions or rules,
each of the form:
Exp $\longrightarrow$ ...A production left-hand side is a single nonterminal. A production right-hand side is either the special symbol $\varepsilon$ (the same $\varepsilon$ that can be used in a regular expression) or a sequence of one or more terminals and/or nonterminals (there is no rule with $\varepsilon$ on the right-hand side in the example given above).
- $N$ is a set of nonterminals.
- $\Sigma$ is a set of terminals.
- $P$ is a set of productions (or rules).
- $S$ is the start nonterminal (sometimes called the goal nonterminal) in $N$. If not specified, then it is the nonterminal that appears on the left-hand side of the first production.
- "true" is a boolean expression, recognized by the token true.
- "false" is a boolean expression, recognized by the token false.
- If exp1 and exp2 are boolean expressions, then so are the following:
- exp1 || exp2
- exp1 && exp2
- ! exp1
- ( exp1 )
- The word "if", followed by a boolean expression in parentheses, followed by a statement, or
- The word "if", followed by a boolean expression in parentheses, followed by a statement, followed by the word "else", followed by a statement.
- Start by setting the "current sequence" to be the start nonterminal.
- Repeat:
- find a nonterminal $X$ in the current sequence;
- find a production in the grammar with $X$ on the left (i.e., of the form $X$ → $\alpha$, where $\alpha$ is either $\varepsilon$ (the empty string) or a sequence of terminals and/or nonterminals);
- Create a new "current sequence" in which $\alpha$ replaces the $X$ found above;
- $\Longrightarrow$ means derives in one step
- $\stackrel{+}\Longrightarrow$ means derives in one or more steps
- $\stackrel{*}\Longrightarrow$ means derives in zero or more steps
- $S$ is the start nonterminal of $G$
- $w$ is a sequence of terminals or $\varepsilon$
- Start with the start nonterminal.
- Repeat:
- choose a leaf nonterminal $X$
- choose a production $X \longrightarrow \alpha$
- the symbols in $\alpha$ become the children of $X$ in the tree
- Overview
- Ambiguous Grammars
- Disambiguating Grammars
- List Grammars
- A Grammar for a Programming Language
- Summary
- more than one leftmost derivation of $w$ or,
- more than one rightmost derivation of $w$, or
- more than one parse tree for $w$
- A grammar is recursive in nonterminal $X$ if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that includes an $X$).
- A grammar is left recursive in $X$ if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that starts with an $X$).
- A grammar is right recursive in $X$ if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps, $X$ derives a sequence of symbols that ends with an $X$).
- For left associativity, use left recursion.
- For right associativity, use right recursion.
- One or more PLUSes (without any separator or terminator).
(Remember, any of the following three grammars defines
this language; you don't need all three lines).
- xList $\longrightarrow$ PLUS | xList xList
- xList $\longrightarrow$ PLUS | xList PLUS
- xList $\longrightarrow$ PLUS | PLUS xList
- One or more runs of one or more PLUSes, each run separated by commas:
- xList $\longrightarrow$ PLUS | xList COMMA xList
- xList $\longrightarrow$ PLUS | xList COMMA PLUS
- xList $\longrightarrow$ PLUS | PLUS COMMA xList
- One or more PLUSes, each PLUS terminated by a semi-colon:
- xList $\longrightarrow$ PLUS SEMICOLON | xList xList
- xList $\longrightarrow$ PLUS SEMICOLON | xList PLUS SEMICOLON
- xList $\longrightarrow$ PLUS SEMICOLON | PLUS SEMICOLON xList
- Zero or more PLUSes (without any separator or terminator):
- xList $\longrightarrow$ $\varepsilon$ | PLUS | xList xList
- xList $\longrightarrow$ $\varepsilon$ | PLUS | xList PLUS
- xList $\longrightarrow$ $\varepsilon$ | PLUS | PLUS xList
- Zero or more PLUSes, each PLUS terminated by a semi-colon:
- xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList xList
- xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | xList PLUS SEMICOLON
- xList $\longrightarrow$ $\varepsilon$ | PLUS SEMICOLON | PLUS SEMICOLON xList
-
The trickiest kind of list is a list of
zero or more x's, separated by commas.
To get it right, think of the definition as follows:
-
Either an empty list, or a non-empty list of x's separated
by commas.
xList $\longrightarrow$ $\varepsilon$ | nonemptyList nonemptyList $\longrightarrow$ PLUS | PLUS COMMA nonemptyList - A program must be a list of functions (and nothing else); no statements may appear outside of a function (similar to C), nor may any variable declarations appear outside of a function (unlike C, which allows global variable declarations outside of functions).
- A program must contain at least one function, as the FunctionList cannot immediately derive $\varepsilon$.
- terminal symbol
- nonterminal symbol
- grammar rule (or production)
- derivation (leftmost derivation, rightmost derivation)
- parse (or derivation) tree
- the language defined by a grammar
- ambiguous grammar
- Motivation and Definition
- Example 1: Value of an Arithmetic Expression
- Example 2: Type of an Expression
- Test Yourself #1
- constants
- the right-hand-side nonterminals' translations
- the right-hand-side tokens' values (e.g., the integer value associated with an INTLITERAL token, or the String value associated with an ID token)
- Build the parse tree.
- Use the translation rules to compute the translation of each nonterminal in the tree, working bottom up (since a nonterminal's value may depend on the value of the symbols on the right-hand side, you need to work bottom-up so that those values are available).
- Both operands of the + operator must be of type $int$.
- Both operands of the && operator must be of type $bool$.
- Both operands of the == operator must have the same (non-$error$) type.
- Operators appear at internal nodes instead of at leaves.
- "Chains" of single productions are collapsed.
- Lists are "flattened".
- Syntactic details (e.g., parentheses, commas, semi-colons) are omitted.
- The grouping symbols are necessary to build the particular parse tree in this example (since otherwise the higher-precedence of multiplication would have grouped the 3 with the 4), and the grouping symbols must be kept as leaves of the parse tree in order to fulfill the property that reading the frontier of the parse tree yields the complete token stream. However, the grouping is implicit in the AST via the parent-child relationship of AST nodes.
-
Because the abstract syntax tree omits syntactic details, many parse trees
may yield the same abstract syntax tree. For example, the following input strings would have
different parse trees but the same AST as the one above:
(3 * (4 + 2))
(3) * ((4) + (2))
3 * ((4 + 2))
(3 * (4 + 2))
- More general than LL(1) grammars (every LL(1) grammar is also LALR(1) but not vice versa).
- The class of grammars accepted by the parser generators yacc, bison, and Java Cup.
- Parsed by bottom-up parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal).
- Harder to understand than LL(1) parsers (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)).
- A general, bottom-up parsing algorithm called the CYK algorithm.
- A technique for parsing LL(1) grammars top-down, called predictive parsing.
- A technique for bottom-up parsing that works for LALR(1) grammars, but since that is rather complicated, we will study a version of the algorithm that only works for a subset of the LALR(1) grammars called SLR(1) grammars.
Overview
Hello, and welcome to the course readings for Drew Davidson's EECS 665, "Compiler Construction". This set of course notes has been shaped by me (Drew) in order to serve as a written companion to the course lectures and discussions offered as part of EECS665. As such, it's written pretty informally, and will often draw on my personal observations and experience. These readings are not intended to serve as a standalone reference for compiler topics or as a guide for building a compiler. Thus, I'll start these readings with a disclaimer exactly opposite of most textbooks: it is not for a general audience, and it is intended to be read straight through from start to back.
Some additional disclaimers: In the interest of brevity, these readings will occasionally adopt a definition and stick to it without fully exploring alternative definitions, or considering all implications of the definition. If you feel that a superior definition exists, please do let me know. Furthermore, the readings constitute a living document, updated and refined when possible. If you see a typo, mistake, or even just something where you'd like some more writing, please do let me know.
While much of the material in the course readings has been updated, modified, or written by me, it is based originally on a set of readings created by Susan Horwitz at the University of Wisconsin-Madison. In the interest of providing the best possible guide to the subject, I will frequently read or re-read classic textbooks on compilers, and whatever course materials from other classes might be floating around on the internet. It is my intention to synthesize this information and provide wholly original material herein. That being said, if you see an example or text in the readings that you feel has not been appropriately credited to an external source, please contact me at your earliest convenience so that we can arrange for the material to be taken down, cited, or come to another arrangement.
IntroductionWhat is a compiler? A common answer amongst my students is something like "A program that turn code into a program", which is honestly not bad (if a bit imprecise). Without dwelling overly much on the messy details, let's refine this definition and run with it. For this class, we'll say that a compiler is:
Here's a simple pictorial view:
A compiler operates in phases; each phase translates the source program from one representation to another. Different compilers may include different phases, and/or may order them somewhat differently. A typical organization is shown below.
Below, we look at each phase of the compiler, giving a high-level overview of how
the code snippet position = initial + rate + 60
is represented
throughout the above workflow.
The Scanner
The scanner performs lexical analysis, which means to group the input characters into the fundamental units of the source language. If some character(s) of input fails to correspond to a valid group, the scanner may also report lexical errors.
Some terminology for lexical analysis:
The definitions of what is a lexeme, token, or bad character all depend on the source language.
Example 1
Here are some C-like lexemes and the corresponding tokens:
lexeme: ; = index tmp 37 102 corresponding token: semicolon assign id id intlit intlit
Note that multiple lexemes can correspond to the same token (e.g., the index
and tmp
lexemes both correspond to id tokens.
Example 2 (the running example)
Given the source code:
position = initial + rate * 60 ;
a C-like scanner would return the following sequence of tokens:
id assign id plus id times intlit semicolon
The errors that a scanner might report are usually due to characters that cannot be a valid part of any token.
Erroneous characters for Java source include hash (#) and control-a.
Erroneous characters for C source include the carat (^).
The Parser
The parser performs syntactic analysis, which means to group the tokens of the language into constructs of the programming language (such as loops, expressions, functions, etc). A good analogy is to think of the scanner as building the "words" of the programming language, and the parser as building the "sentences" or "grammatical phrases" of the source program.
Example
source code:   position = initial + rate * 60 ;
Abstract syntax tree:
Key features of the Abstract-Syntax Tree (AST):
While the above presentation has implied that syntactic analysis comes strictly after lexical analysis has been completed, the two phases can often be interleaved, with the parser serving as the "driver" module: the parser sends requests to the lexical analysis to produce one token at a time, which is then added to the AST or rejected if syntactically invalid. Thus, the AST is built incrementally. One reason for this behavior is that if problems are detected early in the parse, the compiler can emit an error quickly. This "fail-fast" behavior is a major design goal in compiler construction, as it can save users time and compute cycles.
The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. It may also annotate and/or change the abstract syntax tree (e.g., it might annotate each node that represents an expression with its type).
Example:
Abstract syntax tree before semantic analysis:
Abstract syntax tree after semantic analysis:
The Intermediate Code Generator
The intermediate code generator translates from abstract-syntax tree
to intermediate code.
One possibility is 3-address code (code in which each instruction
involves at most 3 operands).
Below is an example of 3-address code for the abstract-syntax tree
shown above.
Note that in this example, the second and third instructions each have
exactly three operands (the location where the result of the
operation is stored, and two operators);
the first and fourth instructions have just two operands ("temp1" and "60"
for instruction 1, and "position" and "temp3" for instruction 4).
The Optimizer
The optimizer tries to improve code generated by the intermediate code
generator.
The goal is usually to make code run faster, but the optimizer may also
try to make the code smaller.
In the example above, an optimizer might first discover that the
conversion of the integer 60 to a floating-point number can be done
at compile time instead of at run time.
Then it might discover that there is no need for "temp1" or "temp3".
Here's the optimized code:
The Code Generator
The code generator generates assembly code from (optimized)
intermediate code. For example, the following x64 code might
be generated for our running example:
It's worth noting that the above representation of the running
example is still a text file, though one that the compiler
proper considers to be the final output. There is still work
left to be done before we have an executable program. Usually,
a collection of programs, collectively referred
to as a compiler toolchain, work in sequence to build the
executable:
When it comes time for the program to
actually be run, the loader maps the executable
file into memory (alongside whatever dynamic libraries
may be required) as a process, and then hands control
to the operating system to actually begin execution
of the program. In modern workstation computing, the loader
is integrated into the Operating System, but in some
legacy, specialized, and embedded systems the loader is still
a distinct program.
Students who use compilers like gcc or llvm are often
surprised at the existence of distinct "post-compilation"
steps to produce an executable. Indeed the term "compiler"
is often used to refer to the full "compiler toolchain".
Nevertheless, the workflow from compilation, to assembly, to
linkage, is represented as a series of distinct
modules within many popular compilers, and each of the tools
can be invoked separately. For example, running the
toolchain for C gcc can be accomplished using cc (the c
compiler proper), as (the gnu assembler), and
ld (the gnu linker).
temp1 = inttofloat(60)
temp2 = rate * temp1
temp3 = initial + temp2
position = temp3
temp2 = rate * 60.0
position = initial + temp2
movl rate(%rip), %eax
imull $60, %eax, %edx
movl initial(%rip), %eax
addl %edx, %eax
movl %eax, position(%rip)
Scanning
OverviewL
Recall that the job of the scanner is to translate the sequence of characters that is the input to the compiler to a corresponding sequence of tokens. In particular, each time the scanner is called it should find the longest sequence of characters in the input, starting with the current character, that corresponds to a token, and should return that token.
It is possible to write a scanner from scratch, but it is often much easier and less error-prone approach is to use a scanner generator like lex or flex (which produce C or C++ code), or jlex (which produces java code). The input to a scanner generator includes one regular expression for each token (and for each construct that must be recognized and ignored, such as whitespace and comments). Therefore, to use a scanner generator you need to understand regular expressions. To understand how the scanner generator produces code that correctly recognizes the strings defined by the regular expressions, you need to understand finite-state machines (FSMs).
Finite State MachinesLA finite-state machine is similar to a compiler in that:
In both cases, the input (the program or the string) is a sequence of characters.
Here's an example of a finite-state-machine that recognizes Pascal identifiers (sequences of one or more letters or digits, starting with a letter):
In this picture:
A FSM is applied to an input (a sequence of characters). It either accepts or rejects that input. Here's how the FSM works:
An input string is accepted by a FSM if:
The language defined by a FSM is the set of strings accepted by the FSM.
The following strings are in the language of the FSM shown above:
The following strings are not in the language of the FSM shown above:
Write a finite-state machine that accepts e-mail addresses, defined as follows:
Let us consider another example of an FSM:
Example: Integer LiteralsLThe following is a finite-state machine that accepts integer literals with an optional + or - sign:
Formal Definition
An FSM is a 5-tuple: $(Q,\Sigma,\delta,q,F)$
Here's a definition of $\delta$ for the above example, using a state transition table:
+ | - | $\mathrm{digit}$ | |
$S$ | $A$ | $A$ | $B$ |
$A$ | $B$ | ||
$B$ | $B$ |
Deterministic and Non-Deterministic FSMsL
There are two kinds of FSM:
Example
Here is a non-deterministic finite-state machine that recognizes the same language as the second example deterministic FSM above (the language of integer literals with an optional sign):
Sometimes, non-deterministic machines are simpler than deterministic ones, though not in this example.
A string is accepted by a non-deterministic finite-state machine if there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. For example, here's what happens when the above machine is run on the input "+75":
After scanning |       | Can be in these states | |
(nothing) | $S$ | $A$ | |
$+$ | $A$ | (stuck) | |
$+7$ | $B$ | (stuck) | |
$+75$ | $B$ | (stuck) |
It is worth noting that there is a theorem that says:
-
For every non-deterministic finite-state machine $M$,
there exists a deterministic machine $M'$ such that $M$ and
$M'$ accept the same language.
How to Implement a FSM
The most straightforward way to program a (deterministic) finite-state machine is to use a table-driven approach. This approach uses a table with one row for each state in the machine, and one column for each possible character. Table[j][k] tells which state to go to from state j on character k. (An empty entry corresponds to the machine getting stuck, which means that the input should be rejected.)
Recall the table for the (deterministic) "integer literal" FSM given above:
+ | - | $\mathrm{digit}$ | |
$S$ | $A$ | $A$ | $B$ |
$A$ | $B$ | ||
$B$ | $B$ |
The table-driven program for a FSM works as follows:
Regular Expressions
Regular expressions provide a compact way to define a language that can be accepted by a finite-state machine. Regular expressions are used in the input to a scanner generator to define each token, and to define things like whitespace and comments that do not correspond to tokens, but must be recognized and ignored.
As an example, recall that a Pascal identifier consists of a letter, followed by zero or more letters or digits. The regular expression for the language of Pascal identifiers is:
-
letter . (letter | digit)*
| | means "or" |
. | means "followed by" |
* | means zero or more instances of |
( ) | are used for grouping |
Often, the "followed by" dot is omitted, and just writing two things next to each other means that one follows the other. For example:
-
letter (letter | digit)*
In fact, the operands of a regular expression should be single characters or the special character epsilon, meaning the empty string (just as the labels on the edges of a FSM should be single characters or epsilon). In the above example, "letter" is used as a shorthand for:
-
a | b | c | ... | z | A | ... | Z
To understand a regular expression, it is necessary to know the precedences of the three operators. They can be understood by analogy with the arithmetic operators for addition, multiplication, and exponentiation:
Regular Expression Operator |
Analogous Arithmetic Operator |
Precedence |
| | plus | lowest precedence |
. | times | middle |
* | exponentiation | highest precedence |
So, for example, the regular expression:
-
$\mathrm{letter}.\mathrm{letter} | \mathrm{digit}\mathrm{^*}$
-
$(\mathrm{letter}.\mathrm{letter}) | (\mathrm{digit}\mathrm{^*})$
Describe (in English) the language defined by each of the following regular expressions:
Example: Integer LiteralsL
Let's re-examine the example of integer literals for regular-expressions:
An integer literal with an optional sign can be defined in English as:
The corresponding regular expression is:
Note that the regular expression for "one or more digits" is:
i.e., "one digit followed by zero or more digits". Since "one or more" is a common pattern, another operator, +, has been defined to mean "one or more". For example, $\mathrm{digit}$+ means "one or more digits", so another way to define integer literals with optional sign is:
The Language Defined by a Regular ExpressionL
Every regular expression defines a language: the set of strings that match the expression. We will not give a formal definition here, instead, we'll give some examples:
Regular Expression | Corresponding Set of Strings | |
$\varepsilon$ | $\{$""$\}$ | |
a | $\{$"a"$\}$ | |
a.b.c | $\{$"abc"$\}$ | |
a | b | c | $\{$"a", "b", "c"$\}$ | |
(a | b | c)* | $\{$"", "a", "b", "c", "aa", "ab", $\ldots$, "bccabb", $\ldots$ $\}$ |
Using Regular Expressions and FSMs to Define a ScannerL
There is a theorem that says that for every regular expression, there is a finite-state machine that defines the same language, and vice versa. This is relevant to scanning because it is usually easy to define the tokens of a language using regular expressions, and then those regular expression can be converted to finite-state machines (which can actually be programmed).
For example, let's consider a very simple language: the language of assignment statements in which the left-hand side is a Pascal identifier (a letter followed by one or more letters or digits), and the right-hand side is one of the following:
Token | Regular Expression |
assign | "=" |
id | letter (letter | digit)* |
plus | "$+$" |
times | "$*$" |
equals | "="."=" |
These regular expressions can be converted into the following finite-state machines:
assign: | |
id: | |
plus: | |
times: | |
equals: |
Given a FSM for each token, how do we write a scanner? Recall that the goal of a scanner is to find the longest prefix of the current input that corresponds to a token. This has two consequences:
Furthermore, remember that regular expressions are used both to define tokens and to define things that must be recognized and skipped (like whitespace and comments). In the first case a value (the current token) must be returned when the regular expression is matched, but in the second case the scanner should simply start up again trying to match another regular expression.
With all this in mind, to create a scanner from a set of FSMs, we must:
For example, the FSM that recognizes Pascal identifiers must be
modified as follows:
with the following table of actions:
F1: put back 1 char, return ID
And here is the combined FSM for the five tokens (with the actions noted below):

with the following table of actions:
F1: put back 1 char; return assign
F2: put back 1 char; return id
F3: return plus
F4: return times
F5: return equals
We can convert this FSM to code using the table-driven technique described above, with a few small modifications:
+ | * | = | $\mathrm{letter}$ | $\mathrm{digit}$ | EOF | |
$S$ | return plus | return times | $B$ | $A$ | ||
$A$ | put back 1 char; return id |
put back 1 char; return id |
put back 1 char; return id |
$A$ | $A$ | return id |
$B$ |
put back 1 char; return assign |
put back 1 char; return assign |
return equals |
put back 1 char; return assign |
put back 1 char; return assign |
return assign |
Suppose we want to extend the very simple language of assignment statements defined above to allow both integer and double literals to occur on the right-hand sides of the assignments. For example:
would be a legal assignment.
What new tokens would have to be defined? What are the regular expressions, the finite-state machines, and the modified finite-state machines that define them? How would the the "combined" finite-state machine given above have to be augmented?
Defining Syntax
Contents
Recall that the input to the parser is a sequence of tokens (received interactively, via calls to the scanner). The parser:
The output depends on whether the input is a syntactically legal program; if so, then the output is some representation of the program:
We know that we can use regular expressions to define languages (for example, the languages of the tokens to be recognized by the scanner). Can we use them to define the language to be recognized by the parser? Unfortunately, the answer is no. Regular expressions are not powerful enough to define many aspects of a programming language's syntax. For example, a regular expression cannot be used to specify that the parentheses in an expression must be balanced, or that every ``else'' statement has a corresponding ``if''. Furthermore, a regular expression doesn't say anything about underlying structure. For example, the following regular expression defines integer arithmetic involving addition, subtraction, multiplication, and division:
digit+ (("+" | "-" | "*" | "/") digit+)*
Example: Simple Arithmetic ExpressionsL
We can write a context-free grammar (CFG) for the language of (very simple) arithmetic expressions involving only subtraction and division. In English:
Exp$\;\longrightarrow\;$ Exp minus Exp
Exp$\;\longrightarrow\;$ Exp divide Exp
Exp$\;\longrightarrow\;$ lparen Exp rparen
And here is how to understand the grammar:
A more compact way to write this grammar is:
Intuitively, the vertical bar means "or", but do not be fooled into thinking that the right-hand sides of grammar rules can contain regular expression operators! This use of the vertical bar is just shorthand for writing multiple rules with the same left-hand-side nonterminal.
Formal DefinitionLA CFG is a 4-tuple $\left( N, \Sigma, P, S \right)$ where:
Example: Boolean Expressions, Assignment Statements, and If StatementsL
The language of boolean expressions can be defined in English as follows:
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Here is a CFG for a language of very simple assignment statements (only statements that assign a boolean value to an identifier):
-
Stmt $\;\longrightarrow\;$
id
assign
BExp
semicolon
We can "combine" the two grammars given above, and add two more rules to get a grammar that defines a language of (very simple) if-statements. In words, an if-statement is:
Here is the combined context-free grammar:
Stmt $\;\longrightarrow\;$ if lparen BExp rparen Stmt else Stmt
Stmt $\;\longrightarrow\;$ id assign BExp semicolon
BExp $\;\longrightarrow\;$ true
BExp $\;\longrightarrow\;$ false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Write a context-free grammar for the language of very simple while loops (in which the loop body only contains one statement) by adding a new production with nonterminal stmt on the left-hand side.
The language defined by a context-free grammar is the set of strings (sequences of terminals) that can be derived from the start nonterminal. What does it mean to derive something?
Thus, we arrive either at $\varepsilon$ or at a string of terminals. That is how we derive a string in the language defined by a CFG.
Below is an example derivation, using the 4 productions for the grammar of arithmetic expressions given above. In this derivation, we use the actual lexemes instead of the token names (e.g., we use the symbol "-" instead of minus).
-
Exp
$\Longrightarrow$ Exp - Exp$\Longrightarrow$ 1 - Exp$\Longrightarrow$ 1 - Exp / Exp$\Longrightarrow$ 1 - Exp / 2
$\Longrightarrow$ 1 - 4 / 2
And here is some useful notation:
So, given the above example, we could write: Exp $\stackrel{+}\Longrightarrow$ 1 - Exp / Exp.
A more formal definition of what it means for a CFG $G$ to define a language may be stated as follows:
where
There are several kinds of derivations that are important. A derivation is a leftmost derivation if it is always the leftmost nonterminal that is chosen to be replaced. It is a rightmost derivation if it is always the rightmost one.
Parse TreesLAnother way to derive things using a context-free grammar is to construct a parse tree (also called a derivation tree) as follows:
Here is the example expression grammar given above:
and, using that grammar, here's a parse tree for the string 1 - 4 / 2:

Stmt $\;\longrightarrow\;$ if lparen BExp rparen Stmt else Stmt
Stmt $\;\longrightarrow\;$ id assign BExp semicolon
BExp $\;\longrightarrow\;$ true
BExp $\;\longrightarrow\;$ false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Question 2: Give a parse tree for the same string.
Syntax for Programming Languages
Contents
While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language.
Context-free Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the context-free languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFG-like definition of a language and recognize whether or not a given string of input tokens belongs that language.
For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a context-free grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations.
We'll caveat the claim that "CFGs are a good way to define programming-language syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes.
The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference):
P2: Exp $\longrightarrow$ Exp minus Exp
P3: Exp $\longrightarrow$ Exp divide Exp
P4: Exp $\longrightarrow$ lparen Exp rparen
Consider the string
1 - 4 / 2
which would be tokenized as
It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence:
Step #2. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ (via production P1)
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ divide $Exp$ (via production P3)
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide $Exp$ (via production P1)
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P1)
This derivation sequence corresponds to the parse tree:
However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently:
Step #1. Exp $\Longrightarrow$ Exp divide Exp (via production P3)
Step #2. $\;\;\;\;\;\Longrightarrow$ Exp minus Exp divide Exp (via production P2)
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus Exp divide Exp (via production P1)
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide Exp (via production P1)
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit
(via production P1)
Which corresponds to the parse tree:
If for grammar $G$ and string $w$ there is:
then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.)
It is worth noting that ambiguity is not a problem for certain applications of Context-Free Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues.
A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is ill-defined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 1-4. These two groupings correspond to expressions with different values: the first tree corresponds to (1-(4/2)) and the second tree corresponds to ((1-4)/2).
A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of Big-O complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P1 can be applied any number of times before producing the integer literal token.
P2: Exp $\longrightarrow$ intlit
It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators.
To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:
P2: Exp $\longrightarrow$ Term
P3: Term $\longrightarrow$ Term divide Term
P4: Term $\longrightarrow$ Factor
P5: Factor $\longrightarrow$ intlit
P6: Factor $\longrightarrow$ lparen Exp rparen
Now let's try using these new rules to build parse trees for
1 - 4 / 2
.
First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:
Now we'll try to construct a parse tree that shows the wrong precedence:

This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).
Draw two parse trees for the expression 5-3-2 using the current expression grammar:
-
exp → exp MINUS exp | term
term → term DIVIDE term | factor
factor → INTLITERAL | LPAREN exp RPAREN
To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars.
The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this).
To write a grammar that correctly expresses operator associativity:
Here's the correct grammar:
Term $\longrightarrow$ Term divide Factor | Factor
Factor $\longrightarrow$ intlit | lparen Exp rparen
And here's the (one and only) parse tree that can be built for 5 - 3 - 2 using this grammar:

Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit.
Here's an ambiguous context-free grammar for this language:
Exp | $\longrightarrow$ | Exp plus Exp | Exp minus Exp | Exp times Exp | Exp divide Exp |
| | Exp pow Exp | lparen Exp rparen | intlit |
First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:
exp | → | exp PLUS exp | | exp MINUS exp | | term |
term | → | term TIMES term | | term DIVIDE term | | factor |
factor | → | factor POW factor | | exponent | |
exponent | → | INTLITERAL | | LPAREN exp RPAREN |
This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):
exp | → exp PLUS term | | exp MINUS term | | term | ||
term | → term TIMES factor | | term DIVIDE factor | | factor | ||
factor
→ exponent POW factor
| | exponent
| exponent
| → INTLITERAL
| | LPAREN exp RPAREN
| |
Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., -3**4 is equivalent to: (-3)**4, not to -(3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.
exp | → exp PLUS term | | exp MINUS term | | term | ||
term | → term TIMES factor | | term DIVIDE factor | | factor | ||
factor | → exponent POW factor | | exponent | |||
exponent
→ MINUS exponent
| | final
| final
| → INTLITERAL
| | LPAREN exp RPAREN
| |
Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not:
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative.
Question 2:
Draw a parse tree (using your final grammar for Question 1)
for the expression: true and not true
.
Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language.
A Grammar for a Programming Language
Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some top-level nonterminals:
Program | $\longrightarrow$ | FunctionList |
FunctionList | $\longrightarrow$ | Function | Function FunctionList |
With these three productions, the basic shape of the program is already taking form, and some important constraints are put in place:
We can now move on to specifying the syntax of a function: A Function is the word "function", optionally preceded by the word "public", followed by an identifier (signifying the name of the function), followed by an open curly brace, followed by the function body, followed by a closing curly brace:
Function | $\longrightarrow$ | public function id lcurly FunctionBody rcurly |
| | function id lcurly FunctionBody rcurly |
Note that the above definition is hardly unique. Consider this alternative, but equivalent, definition for a function:
Function | $\longrightarrow$ | Access function id lcurly FunctionBody rcurly |
Access | $\longrightarrow$ | public | $\varepsilon$ |
Note that the syntax of the language does not provide a list of arguments that the function takes (many popular programming languages expect that a function has a set of parenthesis before the function body).
A function body is a list of zero or more statements:
FunctionBody | $\longrightarrow$ | $\varepsilon$ |
| | StatementList | |
StatementList | $\longrightarrow$ | Statement |
| | Statement StatementList |
From here, we can define various statement types, and so on.
A benefit of the Context-Free Grammar is that a "sub-grammar" can be plugged in to allow complex parse trees to be grown at the non-terminal at which they are rooted. For example, we could allow the rule:
Statement | $\longrightarrow$ | id equals Exp |
(where equals is the symbol =, used here for assignment)
This grammar can then make use of the expression grammar that we defined above to build parse trees with arbitrarily complex mathematical expressions on the right hand side of an assignment statement.
SummaryTo understand how a parser works, we start by understanding context-free grammars, which are used to define the language recognized by the parser. Important terminology includes:
Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both non-empty and possibly empty lists, and for lists both with and without separators and terminators.
Syntax-Directed Definition
Contents
The goal of the parser is to produce output that can be consumed by the next phase of the compiler. Thus far, we have considered using a Context-Free Grammar as a convenient mechanism for defining what strings are syntactically valid programs in the language, and we have discussed techniques for altering the Context-Free Grammar such that the parse trees for those strings match the intended structure of the program. In summary, we've explored how to specify parse trees for a given token stream. However, the parse tree is a relatively cumbersome data structure to work with, made worse by the very alterations we introduced to aid in parsing. For our purposes, we will consider the parse tree to be a purely transitional data structure, useful for structuring the token stream, and then consumed to create the final output to be forwarded on to syntactic analysis.
This reading introduces a method for specifying how to derive a result from a parse tree. This specification is known as a syntax-directed definition (SDD), while the entire technique is known as syntax-directed translation (SDT). The general technique of SDT is applicable for a variety of parsing tasks. To introduce SDD, we will explore several examples of mapping parse trees to a variety of different results, but the application we are ultimately interested in for compilers is to translate a parse tree into an AST. In this reading from parse trees abstract-syntax trees, known as a syntax-directed definition. The name is apt, as the key feature of a syntax-directed definition is that it intermingles the definition
This involves doing a syntax-directed translation -- translating from a sequence of tokens to some other form, based on the underlying syntax.
A syntax-directed translation is defined by augmenting the CFG: a translation rule is defined for each production. A translation rule defines the translation of the left-hand side nonterminal as a function of:
Below is the definition of a syntax-directed translation that translates an arithmetic expression to its integer value. When a nonterminal occurs more than once in a grammar rule, the corresponding translation rule uses subscripts to identify a particular instance of that nonterminal. For example, the rule
has two Exp nonterminals; Exp1 means the left-hand-side Exp, and Exp2 means the right-hand-side Exp.
Also, the notation xxx.value is used to mean the value associated with token xxx and the notation YYY.trans is used to mean the translation associated with nonterminal YYY.
CFG Production | Translation Rule | |||
Exp | $\longrightarrow$ | Exp plus Term | Exp1.trans = Exp2.trans $+$ Term.trans | |
Exp | $\longrightarrow$ | Term | Exp.trans = Term.trans | |
Term | $\longrightarrow$ | Term times Factor | Term1.trans = Term2.trans $\times$ Factor.trans | |
Term | $\longrightarrow$ | Factor | Term.trans = Factor.trans | |
Factor | $\longrightarrow$ | intlit | Factor.trans = intlit.value | |
Factor | $\longrightarrow$ | lparens Exp rparens | Factor.trans = Exp.trans |
Consider evaluating these rules on the input 2 * (4 + 5). The result is the following annotated parse tree:

Example 2L
Consider a language of expressions that use the three operators: +, &&, == using the terminal symbols plus, and, equals, respectively. Integer literals are represented by the same intlit token we've used before, and true and false represent the literals true and false (note that we could have just as well defined a single boollit token that the scanner would populate with either $true$ or $false$).
Let's define a syntax-directed translation that type checks these expressions; i.e., for type-correct expressions, the translation will be the type of the expression (either $int$ or $bool$), and for expressions that involve type errors, the translation will be the special value $error$. We'll use the following type rules:
Here is the CFG and the translation rules:
CFG Production | Translation Rule | |||
Exp | $\longrightarrow$ | Exp plus Term | if $($Exp2.trans == $int$$)$ and $($Term.trans == $int$$)$ then
Exp1.trans = $int$ else Exp1.trans = $error$ |
|
Exp | $\longrightarrow$ | Exp and Term | if $($Exp2.trans == $bool$$)$ and $($Term.trans == $bool$$)$ then
Exp1.trans = $bool$ else Exp1.trans = $error$ |
|
Exp | $\longrightarrow$ | Exp equals Term | if $($Exp2.trans == Term.trans$)$ and $($Term.trans $\neq$ $error$$)$ then
Exp1.trans = $bool$ else Exp1.trans = $error$ |
|
Exp | $\longrightarrow$ | Term | Exp.trans = Term.trans | |
Term | $\longrightarrow$ | true | Term.trans = $bool$ | |
Term | $\longrightarrow$ | false | Term.trans = $bool$ | |
Term | $\longrightarrow$ | intlit | Term.trans = $int$ | |
Term | $\longrightarrow$ | lparens Exp rparens | Term.trans = Exp.trans |
Here's an annotated parse tree for the input (2 + 2) == 4

The following grammar defines the language of base-2 numbers:
-
B $\longrightarrow$ 0
B $\longrightarrow$ 1
B $\longrightarrow$ B 0
B $\longrightarrow$ B 1
Question #1: Define a syntax-directed translation so that the translation of a binary number is its base 10 value.
Question #2: Illustrate your translation scheme by drawing the parse tree for 1001 and annotating each nonterminal in the tree with its translation.
Abstract-Syntax Trees
While Syntax-Directed Definition serves as a general technique for specifying translations of trees for many purposes, one particular application of SDD is of special interest to compiler writers: inducing an Abstract-Syntax Tree (AST) from the Parse Tree. The AST can be thought of as a condensed form of the parse tree, elliding incidental syntactic details of the program input.
Students occasionally are surprised that the conceptual workflow of the compiler is to build the parse tree only to immediately discard it in favor of a different tree data structure. In this chapter we will discuss what the AST is, how the AST differs from the parse tree, and how an SDD may be specified to produce an AST from a parse tree. Hopefully, this treatment communicates why the parse tree is a useful part of syntactic analysis, as well as why the AST is a superior internal representation for later phases of the compiler. It's also worth mentioning that in practice, the parse tree and AST are constructed at the same time in a memory-efficient way that keeps much of the parse tree implicit.
The AST vs the Parse TreeL
First, let's identify the ways in which an abstract-syntax tree (AST) differs from a parse tree:
In general, the AST is a better structure for later stages of the compiler because it omits details having to do with the source language, and just contains information about the essential structure of the program.
Below is an example of the parse tree and the AST for the expression
3 * (4 + 2)
using the usual arithmetic-expression grammar that
reflects the precedences and associativities of the operators.
Note that the parentheses are not needed in the AST because the structure of
the AST defines how the subexpressions are grouped.

It's worth noting a couple of relationships between AST and parse tree:
For constructs other than expressions, the compiler writer has some choices when defining the AST -- but remember that lists (e.g., lists of declarations lists of statements, lists of parameters) should be flattened, that operators (e.g., "assign", "while", "if") go at internal nodes, not at leaves, and that syntactic details are omitted.
-
For example:
Input ===== { x = 0; while (x<10) { x = x+1; } y = x*2; }Parse Tree:


Translation Rules That Build an ASTL
To define a syntax-directed translation so that the translation of an input is the corresponding AST, we first need operations that create AST nodes. Let's use some object-oriented pseudocode, and assume that we have the following class hierarchy:
class ExpNode { } class IntLitNode extends ExpNode { public IntLitNode(int val) {...} } class PlusNode extends ExpNode { public PlusNode( ExpNode e1, ExpNode e2 ) { ... } } class TimesNode extends ExpNode { public TimesNode( ExpNode e1, ExpNode e2 ) { ... } }
Now we can define a syntax-directed translation for simple arithmetic expressions, so that the translation of an expression is its AST:
CFG Production | Translation rules | |||
exp | $\longrightarrow$ | exp PLUS term | exp1.trans = new PlusNode(exp2, term.trans) | |
exp | $\longrightarrow$ | term | exp.trans = term.trans | |
term | $\longrightarrow$ | term TIMES factor | term1.trans = new TimesNode(term2.trans, factor.trans) | |
term | $\longrightarrow$ | factor | term.trans = factor.trans | |
factor | $\longrightarrow$ | INTLITERAL | factor.trans = new IntLitNode(INTLITERAL.value) | |
factor | $\longrightarrow$ | LPARENS exp RPARENS | factor.trans = exp.trans |
Illustrate the syntax-directed translation defined above by drawing the parse tree for the expression 2 + 3 * 4, and annotating the parse tree with its translation (i.e., each nonterminal in the tree will have a pointer to the AST node that is the root of the subtree of the AST that is the nonterminal's translation).
Parsing
Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root.
There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worst-case running time is O(N3), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for non-ambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worst-case runtime O(N2) for non-ambiguous grammars, but even a quadratic-time algorithm may considered too slow.
Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:
and
LALR(1) grammars are:
In this class, we will learn about three parsing techniques: