$\newcommand{\bigsqcap}{\mathop ⨅}$

1. Overview
2. Scanning
3. Defining Syntax
4. Syntax for Programming Languages
5. Syntax-Directed Definition
6. Abstract-Syntax Trees
7. Parsing
8. CYK Parsing
9. LL(1) Parsing
10. Shift/Reduce Parsing
11. Syntax-Directed Translation (SDT)
12. Symbol Tables
13. Semantic Analysis
14. Dataflow
15. Dataflow Solving
16. Types
17. Runtime Environments
18. Variable Access
19. Parameter Passing
20. Code Generation
21. Sections

Welcome

Hello, and welcome to the course readings for Drew Davidson's EECS 665, "Compiler Construction". This set of course notes has been shaped by me (Drew) in order to serve as a written companion to the course lectures and discussions offered as part of EECS665. As such, it's written pretty informally, and will often draw on my personal observations and experience. These readings are not intended to serve as a standalone reference for compiler topics or as a guide for building a compiler. Thus, I'll start these readings with a disclaimer exactly opposite of most textbooks: it is not for a general audience, and it is intended to be read straight through from start to back.

Some additional disclaimers: In the interest of brevity, these readings will occasionally adopt a definition and stick to it without fully exploring alternative definitions, or considering all implications of the definition. If you feel that a superior definition exists, please do let me know. Furthermore, the readings constitute a living document, updated and refined when possible. If you see a typo, mistake, or even just something where you'd like some more writing, please do let me know.

While much of the material in the course readings has been updated, modified, or written by me, it is based originally on a set of readings created by Susan Horwitz at the University of Wisconsin-Madison. In the interest of providing the best possible guide to the subject, I will frequently read or re-read classic textbooks on compilers, and whatever course materials from other classes might be floating around on the internet. It is my intention to synthesize this information and provide wholly original material herein. That being said, if you see an example or text in the readings that you feel has not been appropriately credited to an external source, please contact me at your earliest convenience so that we can arrange for the material to be taken down, cited, or come to another arrangement.

Introduction

What is a compiler? A common answer amongst my students is something like "A program that turn code into a program", which is honestly not bad (if a bit imprecise). Without dwelling overly much on the messy details, let's refine this definition and run with it. For this class, we'll say that a compiler is:

• A recognizer of some source language S.
• A translator (of programs written in S into programs written in some object or target language T).
• A program written in some host language H

Here's a simple pictorial view:

A compiler operates in phases; each phase translates the source program from one representation to another. Different compilers may include different phases, and/or may order them somewhat differently. A typical organization is shown below.

Below, we look at each phase of the compiler, giving a high-level overview of how the code snippet position = initial + rate + 60 is represented throughout the above workflow.

The Scanner

The scanner performs lexical analysis, which means to group the input characters into the fundamental units of the source language. If some character(s) of input fails to correspond to a valid group, the scanner may also report lexical errors.

Some terminology for lexical analysis:

• The scanner groups the characters into lexemes (sequences of characters that "go together").
• Each lexeme corresponds to a token; the scanner forwards each token (plus maybe some additional information, such as a position within at which the token was found in the input file) to the parser.

The definitions of what is a lexeme, token, or bad character all depend on the source language.

### Example 1

Here are some C-like lexemes and the corresponding tokens:

lexeme:                   ;           =        index      tmp      37       102

corresponding token:  semicolon     assign       id        id       intlit     intlit


Note that multiple lexemes can correspond to the same token (e.g., the index and tmp lexemes both correspond to id tokens.

### Example 2 (the running example)

Given the source code:

position  =  initial +  rate   *    60 ;


a C-like scanner would return the following sequence of tokens:

id assign id plus id times intlit semicolon

Error Handling in the Scanner

The errors that a scanner might report are usually due to characters that cannot be a valid part of any token.

Erroneous characters for Java source include hash (#) and control-a.
Erroneous characters for C source include the carat (^).

The Parser

The parser performs syntactic analysis, which means to group the tokens of the language into constructs of the programming language (such as loops, expressions, functions, etc). A good analogy is to think of the scanner as building the "words" of the programming language, and the parser as building the "sentences" or "grammatical phrases" of the source program.

• Groups tokens into "grammatical phrases", discovering the underlying structure of the source program.
• Finds syntax errors. For example, the source code
5 position = ;
could correspond  to the sequence of tokens:
intlit id assign semicolon
All are legal tokens, but that sequence of tokens is erroneous.
• Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared.
• Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.

### Example

source code:   position = initial + rate * 60 ;

Abstract syntax tree:

Key features of the Abstract-Syntax Tree (AST):

• The interior nodes of the tree are operators.
• A node's children are its operands.
• Each subtree forms a "logical unit", e.g., the subtree with * at its root shows that because multiplication has higher precedence than addition, this operation must be performed as a unit (not initial+rate).

Reality Check!

While the above presentation has implied that syntactic analysis comes strictly after lexical analysis has been completed, the two phases can often be interleaved, with the parser serving as the "driver" module: the parser sends requests to the lexical analysis to produce one token at a time, which is then added to the AST or rejected if syntactically invalid. Thus, the AST is built incrementally. One reason for this behavior is that if problems are detected early in the parse, the compiler can emit an error quickly. This "fail-fast" behavior is a major design goal in compiler construction, as it can save users time and compute cycles.

The Semantic Analyzer

The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. It may also annotate and/or change the abstract syntax tree (e.g., it might annotate each node that represents an expression with its type).

### Example:

Abstract syntax tree before semantic analysis:

Abstract syntax tree after semantic analysis:

The Intermediate Code Generator

The intermediate code generator translates from abstract-syntax tree to intermediate code. One possibility is 3-address code (code in which each instruction involves at most 3 operands). Below is an example of 3-address code for the abstract-syntax tree shown above. Note that in this example, the second and third instructions each have exactly three operands (the location where the result of the operation is stored, and two operators); the first and fourth instructions have just two operands ("temp1" and "60" for instruction 1, and "position" and "temp3" for instruction 4).

		temp1 = inttofloat(60)
temp2 = rate * temp1
temp3 = initial + temp2
position = temp3


The Optimizer

The optimizer tries to improve code generated by the intermediate code generator. The goal is usually to make code run faster, but the optimizer may also try to make the code smaller. In the example above, an optimizer might first discover that the conversion of the integer 60 to a floating-point number can be done at compile time instead of at run time. Then it might discover that there is no need for "temp1" or "temp3". Here's the optimized code:

		temp2 = rate * 60.0
position = initial + temp2


The Code Generator

The code generator generates assembly code from (optimized) intermediate code. For example, the following x64 code might be generated for our running example:

        movl    rate(%rip), %eax
imull   $60, %eax, %edx movl initial(%rip), %eax addl %edx, %eax movl %eax, position(%rip)  It's worth noting that the above representation of the running example is still a text file, though one that the compiler proper considers to be the final output. There is still work left to be done before we have an executable program. Usually, a collection of programs, collectively referred to as a compiler toolchain, work in sequence to build the executable: • The assembler transforms an assembly code file (such as one produced by the compiler) into a relocatable, which is either an object file (typically given the .o extension) or a library (typically given the .so or .a extension, for a dynamically-linked or statically-linked library, respectively). • The linker merges the various object files and static libraries output by the assembler into a single executable. When it comes time for the program to actually be run, the loader maps the executable file into memory (alongside whatever dynamic libraries may be required) as a process, and then hands control to the operating system to actually begin execution of the program. In modern workstation computing, the loader is integrated into the Operating System, but in some legacy, specialized, and embedded systems the loader is still a distinct program. Students who use compilers like gcc or llvm are often surprised at the existence of distinct "post-compilation" steps to produce an executable. Indeed the term "compiler" is often used to refer to the full "compiler toolchain". Nevertheless, the workflow from compilation, to assembly, to linkage, is represented as a series of distinct modules within many popular compilers, and each of the tools can be invoked separately. For example, running the toolchain for C gcc can be accomplished using cc (the c compiler proper), as (the gnu assembler), and ld (the gnu linker). Contents OverviewL Recall that the job of the scanner is to translate the sequence of characters that is the input to the compiler to a corresponding sequence of tokens. In particular, each time the scanner is called it should find the longest sequence of characters in the input, starting with the current character, that corresponds to a token, and should return that token. It is possible to write a scanner from scratch, but it is often much easier and less error-prone approach is to use a scanner generator like lex or flex (which produce C or C++ code), or jlex (which produces java code). The input to a scanner generator includes one regular expression for each token (and for each construct that must be recognized and ignored, such as whitespace and comments). Therefore, to use a scanner generator you need to understand regular expressions. To understand how the scanner generator produces code that correctly recognizes the strings defined by the regular expressions, you need to understand finite-state machines (FSMs). Finite State MachinesL A finite-state machine is similar to a compiler in that: • A compiler recognizes legal programs in some (source) language. • A finite-state machine recognizes legal strings in some language. In both cases, the input (the program or the string) is a sequence of characters. Example: Pascal IdentifiersL Here's an example of a finite-state-machine that recognizes Pascal identifiers (sequences of one or more letters or digits, starting with a letter): In this picture: • Nodes are states. • Edges (arrows) are transitions. Each edge should be labeled with a single character. In this example, we've used a single edge labeled "letter" to stand for 52 edges labeled 'a', 'b',$\ldots$, 'z', 'A',$\ldots$, 'Z'. (Similarly, the label "letter,digit" stands for 62 edges labeled 'a',...'Z','0',...'9'.) • S is the start state; every FSM has exactly one (a standard convention is to label the start state "S"). • A is a final state. By convention, final states are drawn using a double circle, and non-final states are drawn using single circles. A FSM may have more than one final state. A FSM is applied to an input (a sequence of characters). It either accepts or rejects that input. Here's how the FSM works: • The FSM starts in its start state. • If there is a edge out of the current state whose label matches the current input character, then the FSM moves to the state pointed to by that edge, and "consumes" that character; otherwise, it gets stuck. • The finite-state machine stops when it gets stuck or when it has consumed all of the input characters. An input string is accepted by a FSM if: • The entire string is consumed (the machine did not get stuck), and • the machine ends in a final state. The language defined by a FSM is the set of strings accepted by the FSM. The following strings are in the language of the FSM shown above: • x • tmp2 • XyZzy • position27 The following strings are not in the language of the FSM shown above: • 123 • a? • 13apples Self Study #1L Write a finite-state machine that accepts e-mail addresses, defined as follows: • a sequence of one or more letters and/or digits, • followed by an at-sign, • followed by one or more letters, • followed by zero or more extensions. • An extension is a dot followed by one or more letters. solution Let us consider another example of an FSM: Example: Integer LiteralsL The following is a finite-state machine that accepts integer literals with an optional + or - sign: Formal Definition An FSM is a 5-tuple:$(Q,\Sigma,\delta,q,F)$•$Q$is a finite set of states ($\{\mathrm{S,A,B}\}$in the above example). •$\Sigma$(an uppercase sigma) is the alphabet of the machine, a finite set of characters that label the edges ($\{+,-,0,1,...,9\}$in the above example). •$q$is the start state, an element of$Q$($\mathrm{S}$in the above example). •$F$is the set of final states, a subset of$Q$({B} in the above example). •$\delta$is the state transition relation:$Q \times \Sigma \rightarrow Q$(i.e., it is a function that takes two arguments -- a state in$Q$and a character in$\Sigma$-- and returns a state in$Q$). Here's a definition of$\delta$for the above example, using a state transition table:  + -$\mathrm{digit}SAABABBB$Deterministic and Non-Deterministic FSMsL There are two kinds of FSM: 1. Deterministic: • No state has more than one outgoing edge with the same label. 2. Non-Deterministic: • States may have more than one outgoing edge with same label. • Edges may be labeled with$\varepsilon$(epsilon), the empty string. The FSM can take an$\varepsilon$-transition without looking at the current input character. Example Here is a non-deterministic finite-state machine that recognizes the same language as the second example deterministic FSM above (the language of integer literals with an optional sign): Sometimes, non-deterministic machines are simpler than deterministic ones, though not in this example. A string is accepted by a non-deterministic finite-state machine if there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. For example, here's what happens when the above machine is run on the input "+75":  After scanning Can be in these states (nothing)$SA+A$(stuck)$+7B$(stuck)$+75B$(stuck) There is one sequence of moves that consumes the entire input and ends in a final state (state B), so this input is accepted by his machine. It is worth noting that there is a theorem that says: For every non-deterministic finite-state machine$M$, there exists a deterministic machine$M'$such that$M$and$M'$accept the same language. ## How to Implement a FSM The most straightforward way to program a (deterministic) finite-state machine is to use a table-driven approach. This approach uses a table with one row for each state in the machine, and one column for each possible character. Table[j][k] tells which state to go to from state j on character k. (An empty entry corresponds to the machine getting stuck, which means that the input should be rejected.) Recall the table for the (deterministic) "integer literal" FSM given above:  + -$\mathrm{digit}SAABABBB$The table-driven program for a FSM works as follows: • Have a variable named state, initialized to S (the start state). • Repeat: • read the next character from the input • use the table to assign a new value to the state variable until the machine gets stuck (the table entry is empty) or the entire input is read. If the machine gets stuck, reject the input. Otherwise, if the current state is a final state, accept the input; otherwise, reject it. ## Regular Expressions Regular expressions provide a compact way to define a language that can be accepted by a finite-state machine. Regular expressions are used in the input to a scanner generator to define each token, and to define things like whitespace and comments that do not correspond to tokens, but must be recognized and ignored. As an example, recall that a Pascal identifier consists of a letter, followed by zero or more letters or digits. The regular expression for the language of Pascal identifiers is: letter . (letter | digit)* The following table explains the symbols used in this regular expression:  | means "or" . means "followed by" * means zero or more instances of ( ) are used for grouping Often, the "followed by" dot is omitted, and just writing two things next to each other means that one follows the other. For example: letter (letter | digit)* In fact, the operands of a regular expression should be single characters or the special character epsilon, meaning the empty string (just as the labels on the edges of a FSM should be single characters or epsilon). In the above example, "letter" is used as a shorthand for: a | b | c | ... | z | A | ... | Z and similarly for "digit". Also, we sometimes put the characters in quotes (this is necessary if you want to use a vertical bar, a dot, or a star character). To understand a regular expression, it is necessary to know the precedences of the three operators. They can be understood by analogy with the arithmetic operators for addition, multiplication, and exponentiation:  Regular ExpressionOperator Analogous Arithmetic Operator Precedence | plus lowest precedence . times middle * exponentiation highest precedence So, for example, the regular expression:$\mathrm{letter}.\mathrm{letter} | \mathrm{digit}\mathrm{^*}$does not define the same language as the expression given above. Since the dot operator has higher precedence than the | operator (and the * operator has the highest precedence of all), this expression is the same as:$(\mathrm{letter}.\mathrm{letter}) | (\mathrm{digit}\mathrm{^*})$and it means "either two letters, or zero or more digits". Self Study #2L Describe (in English) the language defined by each of the following regular expressions: 1.$\mathrm{digit} \; | \; \mathrm{letter} \; \mathrm{letter}$2.$\mathrm{digit} \; | \; \mathrm{letter} \; \mathrm{letter}$* 3.$\mathrm{digit} \; | \; \mathrm{letter}$* solution Example: Integer LiteralsL Let's re-examine the example of integer literals for regular-expressions: An integer literal with an optional sign can be defined in English as: "(nothing or + or -) followed by one or more digits" The corresponding regular expression is:$(+|-|\varepsilon).(\mathrm{digit}.\mathrm{digit}\mathrm{^*})$Note that the regular expression for "one or more digits" is:$\mathrm{digit}.\mathrm{digit}\mathrm{^*}$i.e., "one digit followed by zero or more digits". Since "one or more" is a common pattern, another operator, +, has been defined to mean "one or more". For example,$\mathrm{digit}$+ means "one or more digits", so another way to define integer literals with optional sign is:$({\textbf +}|-|\varepsilon).\mathrm{digit}$+ The Language Defined by a Regular ExpressionL Every regular expression defines a language: the set of strings that match the expression. We will not give a formal definition here, instead, we'll give some examples:  Regular Expression Corresponding Set of Strings$\varepsilon\{$""$\}$a$\{$"a"$\}$a.b.c$\{$"abc"$\}$a | b | c$\{$"a", "b", "c"$\}$(a | b | c)*$\{$"", "a", "b", "c", "aa", "ab",$\ldots$, "bccabb",$\ldots\}$Using Regular Expressions and FSMs to Define a ScannerL There is a theorem that says that for every regular expression, there is a finite-state machine that defines the same language, and vice versa. This is relevant to scanning because it is usually easy to define the tokens of a language using regular expressions, and then those regular expression can be converted to finite-state machines (which can actually be programmed). For example, let's consider a very simple language: the language of assignment statements in which the left-hand side is a Pascal identifier (a letter followed by one or more letters or digits), and the right-hand side is one of the following: • ID + ID • ID * ID • ID == ID This language has five tokens, which can be defined by the following five regular expressions:  Token Regular Expression assign "=" id letter (letter | digit)* plus "$+$" times "$*$" equals "="."=" These regular expressions can be converted into the following finite-state machines:  assign: image/svg+xml s = id: image/svg+xml s letter letter,digit plus: image/svg+xml s + times: image/svg+xml s * equals: image/svg+xml = s = Given a FSM for each token, how do we write a scanner? Recall that the goal of a scanner is to find the longest prefix of the current input that corresponds to a token. This has two consequences: 1. The scanner sometimes needs to look one or more characters beyond the last character of the current token, and then needs to "put back" those characters so that the next time the scanner is called it will have the correct current character. For example, when scanning a program written in the simple assignment-statement language defined above, if the input is "==", the scanner should return the EQUALS token, not two ASSIGN tokens. So if the current character is "=", the scanner must look at the next character to see whether it is another "=" (in which case it will return EQUALS), or is some other character (in which case it will put that character back and return ASSIGN). 2. It is no longer correct to run the FSM program until the machine gets stuck or end-of-input is reached, since in general the input will correspond to many tokens, not just a single token. Furthermore, remember that regular expressions are used both to define tokens and to define things that must be recognized and skipped (like whitespace and comments). In the first case a value (the current token) must be returned when the regular expression is matched, but in the second case the scanner should simply start up again trying to match another regular expression. With all this in mind, to create a scanner from a set of FSMs, we must: • modify the machines so that a state can have an associated action to "put back N characters" and/or to "return token XXX", • we must combine the finite-state machines for all of the tokens in to a single machine, and • we must write a program for the "combined" machine. For example, the FSM that recognizes Pascal identifiers must be modified as follows: with the following table of actions: Actions: F1: put back 1 char, return ID And here is the combined FSM for the five tokens (with the actions noted below): with the following table of actions: Actions: F1: put back 1 char; return assign F2: put back 1 char; return id F3: return plus F4: return times F5: return equals We can convert this FSM to code using the table-driven technique described above, with a few small modifications: • The table will include a column for end-of-file as well as for all possible characters (the end-of-file column is needed, for example, so that the scanner works correctly when an identifier is the last token in the input). • Each table entry may include an action as well as or instead of a new state. • Instead of repeating "read a character; update the state variable" until the machine gets stuck or the entire input is read, the code will repeat: "read a character; perform the action and/or update the state variable" (eventually, the action will be to return a value, so the scanner code will stop, and will start again in the start state next time it is called). Here's the table for the above "combined" FSM:  + * =$\mathrm{letter}\mathrm{digit}$EOF$S$return plus return times$BAA$put back 1 char;return id put back 1 char;return id put back 1 char;return id$AA$return id$B$put back 1 char; return assign put back 1 char; return assign return equals put back 1 char; return assign put back 1 char; return assign return assign TEST YOURSELF #3 Suppose we want to extend the very simple language of assignment statements defined above to allow both integer and double literals to occur on the right-hand sides of the assignments. For example: x = 23 + 5.5 would be a legal assignment. What new tokens would have to be defined? What are the regular expressions, the finite-state machines, and the modified finite-state machines that define them? How would the the "combined" finite-state machine given above have to be augmented? solution Contents OverviewL Recall that the input to the parser is a sequence of tokens (received interactively, via calls to the scanner). The parser: • Groups the tokens into "grammatical phrases". • Discovers the underlying structure of the program. • Finds syntax errors. • Perhaps also performs some actions to find other kinds of errors. The output depends on whether the input is a syntactically legal program; if so, then the output is some representation of the program: • an abstract-syntax tree (maybe + a symbol table), • or intermediate code, • or object code. We know that we can use regular expressions to define languages (for example, the languages of the tokens to be recognized by the scanner). Can we use them to define the language to be recognized by the parser? Unfortunately, the answer is no. Regular expressions are not powerful enough to define many aspects of a programming language's syntax. For example, a regular expression cannot be used to specify that the parentheses in an expression must be balanced, or that every else'' statement has a corresponding if''. Furthermore, a regular expression doesn't say anything about underlying structure. For example, the following regular expression defines integer arithmetic involving addition, subtraction, multiplication, and division: digit+ (("+" | "-" | "*" | "/") digit+)*  but provides no information about the precedence and associativity of the operators. To specify the syntax of a programming language, we use a different formalism, called context-free grammars. Example: Simple Arithmetic ExpressionsL We can write a context-free grammar (CFG) for the language of (very simple) arithmetic expressions involving only subtraction and division. In English: • An integer is an arithmetic expression. • If exp1 and exp2 are arithmetic expressions, then so are the following: • exp1 - exp2 • exp1 / exp2 • ( exp1 ) Here is the corresponding CFG: Exp$\;\longrightarrow\;$intliteral Exp$\;\longrightarrow\;$Exp minus Exp Exp$\;\longrightarrow\;$Exp divide Exp Exp$\;\longrightarrow\;$lparen Exp rparen And here is how to understand the grammar: • The grammar has five terminal symbols: intliteral minus divide lparen rparen. The terminals of a grammar used to define a programming language are the tokens returned by the scanner. • The grammar has one nonterminal: Exp (note that a single name, Exp, is used instead of Exp1 and Exp2 as in the English definition above). • The grammar has four productions or rules, each of the form: Exp$\longrightarrow$... A production left-hand side is a single nonterminal. A production right-hand side is either the special symbol$\varepsilon$(the same$\varepsilon$that can be used in a regular expression) or a sequence of one or more terminals and/or nonterminals (there is no rule with$\varepsilon$on the right-hand side in the example given above). A more compact way to write this grammar is: Exp$\longrightarrow$intliteral | Exp minus Exp | Exp divide Exp | lparen Exp rparen Intuitively, the vertical bar means "or", but do not be fooled into thinking that the right-hand sides of grammar rules can contain regular expression operators! This use of the vertical bar is just shorthand for writing multiple rules with the same left-hand-side nonterminal. Formal DefinitionL A CFG is a 4-tuple$\left( N, \Sigma, P, S \right)$where: •$N$is a set of nonterminals. •$\Sigma$is a set of terminals. •$P$is a set of productions (or rules). •$S$is the start nonterminal (sometimes called the goal nonterminal) in$N$. If not specified, then it is the nonterminal that appears on the left-hand side of the first production. Example: Boolean Expressions, Assignment Statements, and If StatementsL The language of boolean expressions can be defined in English as follows: • "true" is a boolean expression, recognized by the token true. • "false" is a boolean expression, recognized by the token false. • If exp1 and exp2 are boolean expressions, then so are the following: • exp1 || exp2 • exp1 && exp2 • ! exp1 • ( exp1 ) Here is the corresponding CFG: BExp$\;\longrightarrow\;$true BExp$\;\longrightarrow\;$false BExp$\;\longrightarrow\;$BExp or BExp BExp$\;\longrightarrow\;$BExp and BExp BExp$\;\longrightarrow\;$not BExp BExp$\;\longrightarrow\;$lparen BExp rparen Here is a CFG for a language of very simple assignment statements (only statements that assign a boolean value to an identifier): Stmt$\;\longrightarrow\;$id assign BExp semicolon We can "combine" the two grammars given above, and add two more rules to get a grammar that defines a language of (very simple) if-statements. In words, an if-statement is: 1. The word "if", followed by a boolean expression in parentheses, followed by a statement, or 2. The word "if", followed by a boolean expression in parentheses, followed by a statement, followed by the word "else", followed by a statement. Here is the combined context-free grammar: Stmt$\;\longrightarrow\;$if lparen BExp rparen Stmt Stmt$\;\longrightarrow\;$if lparen BExp rparen Stmt else Stmt Stmt$\;\longrightarrow\;$id assign BExp semicolon BExp$\;\longrightarrow\;$true BExp$\;\longrightarrow\;$false BExp$\;\longrightarrow\;$BExp or BExp BExp$\;\longrightarrow\;$BExp and BExp BExp$\;\longrightarrow\;$not BExp BExp$\;\longrightarrow\;$lparen BExp rparen Self-Study #1L Write a context-free grammar for the language of very simple while loops (in which the loop body only contains one statement) by adding a new production with nonterminal stmt on the left-hand side. solution The Language Defined by a CFGL The language defined by a context-free grammar is the set of strings (sequences of terminals) that can be derived from the start nonterminal. What does it mean to derive something? • Start by setting the "current sequence" to be the start nonterminal. • Repeat: • find a nonterminal$X$in the current sequence; • find a production in the grammar with$X$on the left (i.e., of the form$X\alpha$, where$\alpha$is either$\varepsilon$(the empty string) or a sequence of terminals and/or nonterminals); • Create a new "current sequence" in which$\alpha$replaces the$X$found above; until the current sequence contains no nonterminals. Thus, we arrive either at$\varepsilon$or at a string of terminals. That is how we derive a string in the language defined by a CFG. Below is an example derivation, using the 4 productions for the grammar of arithmetic expressions given above. In this derivation, we use the actual lexemes instead of the token names (e.g., we use the symbol "-" instead of minus). Exp$\Longrightarrow$Exp - Exp$\Longrightarrow$1 - Exp$\Longrightarrow$1 - Exp / Exp$\Longrightarrow$1 - Exp / 2$\Longrightarrow$1 - 4 / 2 And here is some useful notation: •$\Longrightarrow$means derives in one step •$\stackrel{+}\Longrightarrow$means derives in one or more steps •$\stackrel{*}\Longrightarrow$means derives in zero or more steps So, given the above example, we could write: Exp$\stackrel{+}\Longrightarrow$1 - Exp / Exp. A more formal definition of what it means for a CFG$G$to define a language may be stated as follows: $$\mathcal{L}(G) = \left\{ w \middle| S \stackrel{+}\Longrightarrow w\right\}$$ where •$S$is the start nonterminal of$G$•$w$is a sequence of terminals or$\varepsilon$Leftmost and Rightmost DerivationsL There are several kinds of derivations that are important. A derivation is a leftmost derivation if it is always the leftmost nonterminal that is chosen to be replaced. It is a rightmost derivation if it is always the rightmost one. Parse TreesL Another way to derive things using a context-free grammar is to construct a parse tree (also called a derivation tree) as follows: • Start with the start nonterminal. • Repeat: • choose a leaf nonterminal$X$• choose a production$X \longrightarrow \alpha$• the symbols in$\alpha$become the children of$X$in the tree until there are no more leaf nonterminals left. The derived string is formed by reading the leaf nodes from left to right. Here is the example expression grammar given above: Exp$\longrightarrow$intliteral | Exp minus Exp | Exp divide Exp | lparen Exp rparen and, using that grammar, here's a parse tree for the string 1 - 4 / 2: Self-Study #2L Below is the CFG for very simple if-statements used earlier. Stmt$\;\longrightarrow\;$if lparen BExp rparen Stmt Stmt$\;\longrightarrow\;$if lparen BExp rparen Stmt else Stmt Stmt$\;\longrightarrow\;$id assign BExp semicolon BExp$\;\longrightarrow\;$true BExp$\;\longrightarrow\;$false BExp$\;\longrightarrow\;$BExp or BExp BExp$\;\longrightarrow\;$BExp and BExp BExp$\;\longrightarrow\;$not BExp BExp$\;\longrightarrow\;$lparen BExp rparen Question 1: Give a derivation for the string: if (! true ) x = false; Is your derivation leftmost, rightmost, or neither? Question 2: Give a parse tree for the same string. solution Contents OverviewL While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language. Context-free Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the context-free languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFG-like definition of a language and recognize whether or not a given string of input tokens belongs that language. For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a context-free grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations. Reality Check! We'll caveat the claim that "CFGs are a good way to define programming-language syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes. Ambiguous GrammarsL The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference): P1: Exp$\longrightarrow$intlit P2: Exp$\longrightarrow$Exp minus Exp P3: Exp$\longrightarrow$Exp divide Exp P4: Exp$\longrightarrow$lparen Exp rparen Consider the string 1 - 4 / 2 which would be tokenized as intlit minus intlit divide intlit ) It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence: Step #1.$Exp\LongrightarrowExp$minus$Exp$(via production P2) Step #2.$\;\;\;\;\;\Longrightarrow$intlit minus$Exp$(via production P1) Step #3.$\;\;\;\;\;\Longrightarrow$intlit minus$Exp$divide$Exp$(via production P3) Step #4.$\;\;\;\;\;\Longrightarrow$intlit minus intlit divide$Exp$(via production P1) Step #5.$\;\;\;\;\;\Longrightarrow$intlit minus intlit divide intlit (via production P1) This derivation sequence corresponds to the parse tree: However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently: Step #1. Exp$\Longrightarrow$Exp divide Exp (via production P3) Step #2.$\;\;\;\;\;\Longrightarrow$Exp minus Exp divide Exp (via production P2) Step #3.$\;\;\;\;\;\Longrightarrow$intlit minus Exp divide Exp (via production P1) Step #4.$\;\;\;\;\;\Longrightarrow$intlit minus intlit divide Exp (via production P1) Step #5.$\;\;\;\;\;\Longrightarrow$intlit minus intlit divide intlit (via production P1) Which corresponds to the parse tree: If for grammar$G$and string$w$there is: • more than one leftmost derivation of$w$or, • more than one rightmost derivation of$w$, or • more than one parse tree for$w$then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.) It is worth noting that ambiguity is not a problem for certain applications of Context-Free Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues. A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is ill-defined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 1-4. These two groupings correspond to expressions with different values: the first tree corresponds to (1-(4/2)) and the second tree corresponds to ((1-4)/2). A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of Big-O complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P1 can be applied any number of times before producing the integer literal token. P1: Exp$\longrightarrow$Exp P2: Exp$\longrightarrow$intlit Disambiguating GrammarsL It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators. PrecedenceL To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("-" in our case), then write a rule for the operator(s) with the next lowest precedence, etc: P1: Exp$\longrightarrow$Exp minus Exp P2: Exp$\longrightarrow$Term P3: Term$\longrightarrow$Term divide Term P4: Term$\longrightarrow$Factor P5: Factor$\longrightarrow$intlit P6: Factor$\longrightarrow$lparen Exp rparen Now let's try using these new rules to build parse trees for 1 - 4 / 2. First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction: Now we'll try to construct a parse tree that shows the wrong precedence: Associativity This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)). TEST YOURSELF #1 Draw two parse trees for the expression 5-3-2 using the current expression grammar: exp exp MINUS exp | term term term DIVIDE term | factor factor INTLITERAL | LPAREN exp RPAREN One of your parse trees should correctly group 5-3, and the other should incorrectly group 3-2. solution To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars. • A grammar is recursive in nonterminal$X$if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps,$X$derives a sequence of symbols that includes an$X$). • A grammar is left recursive in$X$if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps,$X$derives a sequence of symbols that starts with an$X$). • A grammar is right recursive in$X$if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps,$X$derives a sequence of symbols that ends with an$X$). The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this). To write a grammar that correctly expresses operator associativity: • For left associativity, use left recursion. • For right associativity, use right recursion. Here's the correct grammar: Exp$\longrightarrow$Exp minus Term | Term Term$\longrightarrow$Term divide Factor | Factor Factor$\longrightarrow$intlit | lparen Exp rparen And here's the (one and only) parse tree that can be built for 5 - 3 - 2 using this grammar: Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit. Here's an ambiguous context-free grammar for this language:  Exp$\longrightarrow$Exp plus Exp | Exp minus Exp | Exp times Exp | Exp divide Exp | Exp pow Exp | lparen Exp rparen | intlit First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:  exp → exp PLUS exp | exp MINUS exp | term term → term TIMES term | term DIVIDE term | factor factor → factor POW factor | exponent exponent → INTLITERAL | LPAREN exp RPAREN This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):  exp → exp PLUS term | exp MINUS term | term term → term TIMES factor | term DIVIDE factor | factor factor → exponent POW factor | exponent exponent → INTLITERAL | LPAREN exp RPAREN Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., -3**4 is equivalent to: (-3)**4, not to -(3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.  exp → exp PLUS term | exp MINUS term | term term → term TIMES factor | term DIVIDE factor | factor factor → exponent POW factor | exponent exponent → MINUS exponent | final final → INTLITERAL | LPAREN exp RPAREN TEST YOURSELF #2 Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not: BExp$\;\longrightarrow\;$true BExp$\;\longrightarrow\;$false BExp$\;\longrightarrow\;$BExp or BExp BExp$\;\longrightarrow\;$BExp and BExp BExp$\;\longrightarrow\;$not BExp BExp$\;\longrightarrow\;$lparen BExp rparen Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative. Question 2: Draw a parse tree (using your final grammar for Question 1) for the expression: true and not true. solution Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language. • One or more PLUSes (without any separator or terminator). (Remember, any of the following three grammars defines this language; you don't need all three lines). 1. xList$\longrightarrow$PLUS | xList xList 2. xList$\longrightarrow$PLUS | xList PLUS 3. xList$\longrightarrow$PLUS | PLUS xList • One or more runs of one or more PLUSes, each run separated by commas: 1. xList$\longrightarrow$PLUS | xList COMMA xList 2. xList$\longrightarrow$PLUS | xList COMMA PLUS 3. xList$\longrightarrow$PLUS | PLUS COMMA xList • One or more PLUSes, each PLUS terminated by a semi-colon: 1. xList$\longrightarrow$PLUS SEMICOLON | xList xList 2. xList$\longrightarrow$PLUS SEMICOLON | xList PLUS SEMICOLON 3. xList$\longrightarrow$PLUS SEMICOLON | PLUS SEMICOLON xList • Zero or more PLUSes (without any separator or terminator): 1. xList$\longrightarrow\varepsilon$| PLUS | xList xList 2. xList$\longrightarrow\varepsilon$| PLUS | xList PLUS 3. xList$\longrightarrow\varepsilon$| PLUS | PLUS xList • Zero or more PLUSes, each PLUS terminated by a semi-colon: 1. xList$\longrightarrow\varepsilon$| PLUS SEMICOLON | xList xList 2. xList$\longrightarrow\varepsilon$| PLUS SEMICOLON | xList PLUS SEMICOLON 3. xList$\longrightarrow\varepsilon$| PLUS SEMICOLON | PLUS SEMICOLON xList • The trickiest kind of list is a list of zero or more x's, separated by commas. To get it right, think of the definition as follows: Either an empty list, or a non-empty list of x's separated by commas. We already know how to write a grammar for a non-empty list of x's separated by commas, so now it's easy to write the grammar:  xList$\longrightarrow\varepsilon$| nonemptyList nonemptyList$\longrightarrow$PLUS | PLUS COMMA nonemptyList Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some top-level nonterminals:  Program$\longrightarrow$FunctionList FunctionList$\longrightarrow$Function | Function FunctionList With these three productions, the basic shape of the program is already taking form, and some important constraints are put in place: • A program must be a list of functions (and nothing else); no statements may appear outside of a function (similar to C), nor may any variable declarations appear outside of a function (unlike C, which allows global variable declarations outside of functions). • A program must contain at least one function, as the FunctionList cannot immediately derive$\varepsilon$. We can now move on to specifying the syntax of a function: A Function is the word "function", optionally preceded by the word "public", followed by an identifier (signifying the name of the function), followed by an open curly brace, followed by the function body, followed by a closing curly brace:  Function$\longrightarrow$public function id lcurly FunctionBody rcurly | function id lcurly FunctionBody rcurly Note that the above definition is hardly unique. Consider this alternative, but equivalent, definition for a function:  Function$\longrightarrow$Access function id lcurly FunctionBody rcurly Access$\longrightarrow$public |$\varepsilon$Note that the syntax of the language does not provide a list of arguments that the function takes (many popular programming languages expect that a function has a set of parenthesis before the function body). A function body is a list of zero or more statements:  FunctionBody$\longrightarrow\varepsilon$| StatementList StatementList$\longrightarrow$Statement | Statement StatementList From here, we can define various statement types, and so on. A benefit of the Context-Free Grammar is that a "sub-grammar" can be plugged in to allow complex parse trees to be grown at the non-terminal at which they are rooted. For example, we could allow the rule:  Statement$\longrightarrow$id equals Exp (where equals is the symbol =, used here for assignment) This grammar can then make use of the expression grammar that we defined above to build parse trees with arbitrarily complex mathematical expressions on the right hand side of an assignment statement. Summary To understand how a parser works, we start by understanding context-free grammars, which are used to define the language recognized by the parser. Important terminology includes: • terminal symbol • nonterminal symbol • grammar rule (or production) • derivation (leftmost derivation, rightmost derivation) • parse (or derivation) tree • the language defined by a grammar • ambiguous grammar Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both non-empty and possibly empty lists, and for lists both with and without separators and terminators. Contents The goal of the parser is to produce output that can be consumed by the next phase of the compiler. Thus far, we have considered using a Context-Free Grammar as a convenient mechanism for defining what strings are syntactically valid programs in the language, and we have discussed techniques for altering the Context-Free Grammar such that the parse trees for those strings match the intended structure of the program. In summary, we've explored how to specify parse trees for a given token stream. However, the parse tree is a relatively cumbersome data structure to work with, made worse by the very alterations we introduced to aid in parsing. For our purposes, we will consider the parse tree to be a purely transitional data structure, useful for structuring the token stream, and then consumed to create the final output to be forwarded on to syntactic analysis. This reading introduces a method for specifying how to derive a result from a parse tree. This specification is known as a syntax-directed definition (SDD), while the entire technique is known as syntax-directed translation (SDT). The general technique of SDT is applicable for a variety of parsing tasks. To introduce SDD, we will explore several examples of mapping parse trees to a variety of different results, but the application we are ultimately interested in for compilers is to translate a parse tree into an AST. In this reading from parse trees abstract-syntax trees, known as a syntax-directed definition. The name is apt, as the key feature of a syntax-directed definition is that it intermingles the definition This involves doing a syntax-directed translation -- translating from a sequence of tokens to some other form, based on the underlying syntax. A syntax-directed translation is defined by augmenting the CFG: a translation rule is defined for each production. A translation rule defines the translation of the left-hand side nonterminal as a function of: • constants • the right-hand-side nonterminals' translations • the right-hand-side tokens' values (e.g., the integer value associated with an INTLITERAL token, or the String value associated with an ID token) To translate an input string: 1. Build the parse tree. 2. Use the translation rules to compute the translation of each nonterminal in the tree, working bottom up (since a nonterminal's value may depend on the value of the symbols on the right-hand side, you need to work bottom-up so that those values are available). The translation of the string is the translation of the parse tree's root nonterminal. Example 1L Below is the definition of a syntax-directed translation that translates an arithmetic expression to its integer value. When a nonterminal occurs more than once in a grammar rule, the corresponding translation rule uses subscripts to identify a particular instance of that nonterminal. For example, the rule Exp$\longrightarrow$Exp plus Term has two Exp nonterminals; Exp1 means the left-hand-side Exp, and Exp2 means the right-hand-side Exp. Also, the notation xxx.value is used to mean the value associated with token xxx and the notation YYY.trans is used to mean the translation associated with nonterminal YYY.  CFG Production Translation Rule Exp$\longrightarrow$Exp plus Term Exp1.trans = Exp2.trans$+$Term.trans Exp$\longrightarrow$Term Exp.trans = Term.trans Term$\longrightarrow$Term times Factor Term1.trans = Term2.trans$\times$Factor.trans Term$\longrightarrow$Factor Term.trans = Factor.trans Factor$\longrightarrow$intlit Factor.trans = intlit.value Factor$\longrightarrow$lparens Exp rparens Factor.trans = Exp.trans Consider evaluating these rules on the input 2 * (4 + 5). The result is the following annotated parse tree: Example 2L Consider a language of expressions that use the three operators: +, &&, == using the terminal symbols plus, and, equals, respectively. Integer literals are represented by the same intlit token we've used before, and true and false represent the literals true and false (note that we could have just as well defined a single boollit token that the scanner would populate with either$true$or$false$). Let's define a syntax-directed translation that type checks these expressions; i.e., for type-correct expressions, the translation will be the type of the expression (either$int$or$bool$), and for expressions that involve type errors, the translation will be the special value$error$. We'll use the following type rules: 1. Both operands of the + operator must be of type$int$. 2. Both operands of the && operator must be of type$bool$. 3. Both operands of the == operator must have the same (non-$error$) type. Here is the CFG and the translation rules:  CFG Production Translation Rule Exp$\longrightarrow$Exp plus Term if$($Exp2.trans ==$int$$) and (Term.trans == int$$)$then Exp1.trans =$int$else Exp1.trans =$error$Exp$\longrightarrow$Exp and Term if$($Exp2.trans ==$bool$$) and (Term.trans == bool$$)$then Exp1.trans =$bool$else Exp1.trans =$error$Exp$\longrightarrow$Exp equals Term if$($Exp2.trans == Term.trans$)$and$($Term.trans$\neqerror$$) then Exp1.trans = bool else Exp1.trans = error Exp \longrightarrow Term Exp.trans = Term.trans Term \longrightarrow true Term.trans = bool Term \longrightarrow false Term.trans = bool Term \longrightarrow intlit Term.trans = int Term \longrightarrow lparens Exp rparens Term.trans = Exp.trans Here's an annotated parse tree for the input (2 + 2) == 4 Self-Study #1L The following grammar defines the language of base-2 numbers: B \longrightarrow 0 B \longrightarrow 1 B \longrightarrow B 0 B \longrightarrow B 1 Question #1: Define a syntax-directed translation so that the translation of a binary number is its base 10 value. Question #2: Illustrate your translation scheme by drawing the parse tree for 1001 and annotating each nonterminal in the tree with its translation. solution ContentsL (Part of Syntactic-Analysis) Topic IntroductionL While Syntax-Directed Definition serves as a general technique for specifying translations of trees for many purposes, one particular application of SDD is of special interest to compiler writers: inducing an Abstract-Syntax Tree (AST) from the Parse Tree. The AST can be thought of as a condensed form of the parse tree, elliding incidental syntactic details of the program input. Students occasionally are surprised that the conceptual workflow of the compiler is to build the parse tree only to immediately discard it in favor of a different tree data structure. In this chapter we will discuss what the AST is, how the AST differs from the parse tree, and how an SDD may be specified to produce an AST from a parse tree. Hopefully, this treatment communicates why the parse tree is a useful part of syntactic analysis, as well as why the AST is a superior internal representation for later phases of the compiler. It's also worth mentioning that in practice, the parse tree and AST are constructed at the same time in a memory-efficient way that keeps much of the parse tree implicit. The AST vs the Parse TreeL First, let's identify the ways in which an abstract-syntax tree (AST) differs from a parse tree: • Operators appear at internal nodes instead of at leaves. • "Chains" of single productions are collapsed. • Lists are "flattened". • Syntactic details (e.g., parentheses, commas, semi-colons) are omitted. In general, the AST is a better structure for later stages of the compiler because it omits details having to do with the source language, and just contains information about the essential structure of the program. Below is an example of the parse tree and the AST for the expression 3 * (4 + 2) using the usual arithmetic-expression grammar that reflects the precedences and associativities of the operators. Note that the parentheses are not needed in the AST because the structure of the AST defines how the subexpressions are grouped. It's worth noting a couple of relationships between AST and parse tree: 1. The grouping symbols are necessary to build the particular parse tree in this example (since otherwise the higher-precedence of multiplication would have grouped the 3 with the 4), and the grouping symbols must be kept as leaves of the parse tree in order to fulfill the property that reading the frontier of the parse tree yields the complete token stream. However, the grouping is implicit in the AST via the parent-child relationship of AST nodes. 2. Because the abstract syntax tree omits syntactic details, many parse trees may yield the same abstract syntax tree. For example, the following input strings would have different parse trees but the same AST as the one above: • (3 * (4 + 2)) • (3) * ((4) + (2)) • 3 * ((4 + 2)) • (3 * (4 + 2)) The many-to-one nature of parse trees to ASTs is a feature, and it is indicative of the AST as being a more condensed representation of the program. It also indicates that some Parse Tree could be unparsed'' from an AST, but it might not be the original parse tree. ASTs for Program ConstructsL For constructs other than expressions, the compiler writer has some choices when defining the AST -- but remember that lists (e.g., lists of declarations lists of statements, lists of parameters) should be flattened, that operators (e.g., "assign", "while", "if") go at internal nodes, not at leaves, and that syntactic details are omitted. For example: Input ===== { x = 0; while (x<10) { x = x+1; } y = x*2; }  Parse Tree: Abstract Syntax Tree: Note that in the AST there is just one stmtList node, with a list of three children (the list of statements has been "flattened"). Also, the "operators" for the statements (assign and while) have been "moved up" to internal nodes (instead of appearing as tokens at the leaves). And finally, syntactic details (curly braces and semi-colons) have been omitted. Translation Rules That Build an ASTL To define a syntax-directed translation so that the translation of an input is the corresponding AST, we first need operations that create AST nodes. Let's use some object-oriented pseudocode, and assume that we have the following class hierarchy: class ExpNode { } class IntLitNode extends ExpNode { public IntLitNode(int val) {...} } class PlusNode extends ExpNode { public PlusNode( ExpNode e1, ExpNode e2 ) { ... } } class TimesNode extends ExpNode { public TimesNode( ExpNode e1, ExpNode e2 ) { ... } }  Now we can define a syntax-directed translation for simple arithmetic expressions, so that the translation of an expression is its AST:  CFG Production Translation rules exp \longrightarrow exp PLUS term exp1.trans = new PlusNode(exp2, term.trans) exp \longrightarrow term exp.trans = term.trans term \longrightarrow term TIMES factor term1.trans = new TimesNode(term2.trans, factor.trans) term \longrightarrow factor term.trans = factor.trans factor \longrightarrow INTLITERAL factor.trans = new IntLitNode(INTLITERAL.value) factor \longrightarrow LPARENS exp RPARENS factor.trans = exp.trans Self-StudyL Illustrate the syntax-directed translation defined above by drawing the parse tree for the expression 2 + 3 * 4, and annotating the parse tree with its translation (i.e., each nonterminal in the tree will have a pointer to the AST node that is the root of the subtree of the AST that is the nonterminal's translation). solution Overview Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root. There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worst-case running time is O(N3), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for non-ambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worst-case runtime O(N2) for non-ambiguous grammars, but even a quadratic-time algorithm may considered too slow. Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are: and LALR(1) grammars are: • More general than LL(1) grammars (every LL(1) grammar is also LALR(1) but not vice versa). • The class of grammars accepted by the parser generators yacc, bison, and Java Cup. • Parsed by bottom-up parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal). • Harder to understand than LL(1) parsers (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)). Sample Parsers In this class, we will learn about three parsing techniques: 1. A general, bottom-up parsing algorithm called the CYK algorithm. 2. A technique for parsing LL(1) grammars top-down, called predictive parsing. 3. A technique for bottom-up parsing that works for LALR(1) grammars, but since that is rather complicated, we will study a version of the algorithm that only works for a subset of the LALR(1) grammars called SLR(1) grammars. Contents Overview Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root. There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worst-case running time is O(N3), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for non-ambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worst-case runtime O(N2) for non-ambiguous grammars. However, many of these algorithms are rather complicated, so we will look at one that is not so complicated, but always has O(N3) runtime. This algorithm is called the CYK algorithm, for its inventors: Cocke, Younger, and Kasami. The CYK AlgorithmL The CYK algorithm can be used to parse any CFG (i.e., to determine, for any input, whether that input is in the language of a given CFG). It essentially builds all of the possible parse trees for all (consecutive) fragments of the input, bottom-up. It accepts the input iff it is able to build a complete parse tree: one with the CFG's start nonterminal at the root, and the entire input at the leaves. Chomsky Normal Form The CYK algorithm requires that the CFG be in CNF: Chomsky normal form. This means that the CFG rules must all be in one of the following forms: 1. X \longrightarrow t 2. X \longrightarrow A B where t is a terminal symbol, and A and B are nonterminal symbols. Also, grammar rules of the form X \longrightarrow \varepsilon are not allowed, unless the left nonterminal is the start nonterminal. If there is a rule S \longrightarrow \varepsilon (where S is the start nonterminal), then S cannot occur on the right-hand side of any rule. Conversion to CNF Fortunately, it is easy to convert any CFG to CNF. The conversion involves the following steps: 1. Eliminate \varepsilon rules 2. Eliminate unit rules (rules with exactly one nonterminal on the right) 3. Fix remaining rules so that all rules have either a single terminal or exactly two nonterminals on the right. Example: Conversion to CNF We'll start with a very simple grammar for function calls (the function name, followed by a list of zero or more comma-separated arguments, where each argument can only be an identifier).  F \longrightarrow id ( A ) A \longrightarrow \varepsilon A \longrightarrow N N \longrightarrow id N \longrightarrow id , N ### Step 1 (eliminate \varepsilon rules): This CFG has one rule with epsilon on the right-hand side (the second rule). That rule has A on the left-hand side. We'll remove the \varepsilon-rule by copying the rule that has A on the right (the first rule in the CFG above), and replacing that A with nothing (\varepsilon):  F \longrightarrow id ( ) Now we have this CFG:  F \longrightarrow id ( A ) F \longrightarrow id ( ) A \longrightarrow N N \longrightarrow id N \longrightarrow id , N Note that if A occurred more than once on the right of some rule, we would have to make as many copies as needed to get all combinations of A being there and being replaced by \varepsilon. For example, if we had this rule:  X \longrightarrow A x A y A we'd have to add all of the following new rules, as well as keeping the original rule:  X \longrightarrow A x A y X \longrightarrow A x y A X \longrightarrow A x y X \longrightarrow x A y A X \longrightarrow x A y X \longrightarrow x y A X \longrightarrow x y Furthermore, if we have a rule  S \longrightarrow \varepsilon where S is the start nonterminal, and there is a grammar rule with S on the right, then we need to do the following: • Add a new start nonterminal S' • And new rules  S' \longrightarrow S S' \longrightarrow \varepsilon • Remove the rule  S \longrightarrow \varepsilon ### Step 2 (eliminate unit rules): Our current CFG looks like this:  F \longrightarrow id ( A ) F \longrightarrow id ( ) A \longrightarrow N N \longrightarrow id N \longrightarrow id , N It includes one unit rule, a rule with exactly 1 nonterminal (and no terminals) on the right:  A \longrightarrow N Because there are no other rules with A on the left, we can remove the unit rule by replacing every occurrence of A on the right side of some rule with N. (If there were other rules with A on the left, we would instead make copies of the rules with A on the right, and replace A with N in the copies). Here's our CFG after removing the unit rule:  F \longrightarrow id ( N ) F \longrightarrow id ( ) N \longrightarrow id N \longrightarrow id , N ### Step 3 (fix remaining rules): The last step in conversion to CNF is to fix the remaining rules so that each one has either a single terminal, or exactly two nonterminals on the right. #### Handle rules with terminals on the right First, we find the rules that have a terminal plus some other symbols on the right:  F \longrightarrow id ( N ) F \longrightarrow id ( ) N \longrightarrow id , N For each such terminal t, we add a new rule  X \longrightarrow t Then we replace t with X in the original rules. Here's the result (including the one rule that was already in the correct form):  F \longrightarrow I L N R F \longrightarrow I L R N \longrightarrow id N \longrightarrow I C N I \longrightarrow id L \longrightarrow ( R \longrightarrow ) C \longrightarrow , #### Handle rules with more than 2 nonterminals on the right Now, for every right-hand side that has more than two nonterminals, we replace all but the first nonterminal on the right with a new nonterminal, and we add a new rule for the new nonterminal. For example, we handle the rule  F \longrightarrow I L N R as follows:  F \longrightarrow I W W \longrightarrow L N R Then we handle the new rule (with W on the left) as follows:  W \longrightarrow L X X \longrightarrow N R If we take care of the other two "problem" rules:  F \longrightarrow I L R N \longrightarrow I C N We finish with the following CFG (in CNF):  F \longrightarrow I W F \longrightarrow I Y W \longrightarrow L X X \longrightarrow N R Y \longrightarrow L R N \longrightarrow id N \longrightarrow I Z Z \longrightarrow C N I \longrightarrow id L \longrightarrow ( R \longrightarrow ) C \longrightarrow , CYK Self-Study #1L The CFG below is for a grammar of very simple statements. The terminal symbols are id, =, (, ), ++, and read. Convert this CFG to CNF.  S \longrightarrow id = id S \longrightarrow id ( ) S \longrightarrow id ++ S \longrightarrow read ( id ) S \longrightarrow S S solution Parsing an InputL Now that we have converted our CFG to Chomsky Normal form, we can use it to try to parse an input. Let's use the input f(x,y) as an example. The actual sequence of tokens for that input would be: id ( id , id ) To parse an input of length N, we use an N-by-N grid. First, we fill in the squares along the diagonal. The kth square contains the nonterminal(s) X such that there is a CFG rule X \longrightarrow t and t is the kth token. Here's the 6-by-6 grid for our example, with the diagonal filled in (and with the rows and columns numbered): Now we fill in each successive diagonal above the one that's already filled in. When filling in a square, we consider what's already in the squares to the left and below it. For example, to fill in the square in position 2,1, we look at the contents of the square to its left (I,N) and below it (L). If there were a grammar rule with "I L" or with "N L" on the right-hand side, we would put the left-side nonterminal into the square in position 2,1. However, for our example, there are no such grammar rules, so square 2,1 stays empty. The first new square that we can fill in is in position 5,4. That square has C to its left, and I, N below it. There is a grammar rule Z \longrightarrow C N, so we put a Z in square 5,4. If there had also been a grammar rule H \longrightarrow C I, we would also have put an H in square 5,4. Note that we also fill in square 5,5 due to the rule X \longrightarrow N R. Here is the grid so far: The squares in the next diagonal each have two squares to their left and two squares below. We look at pairs of squares: first the square all the way to the left and the square immediately below, then the square immediately to the left and the one two squares down. If either square in the current pair is empty, we go on to the next pair. Each time we look at a pair of non-empty squares, we are looking for a nonterminal A in the square to the left and a nonterminal B in the square that is below, such that there is a grammar rule with A B on the right. If there is such a grammar rule, we fill in the current square with the left-side nonterminal. For example, when trying to fill in the square in position 3,5, we see I, N in position 3,3, and Z in position 4,5. There is a grammar rule N \longrightarrow I Z, so we put an N is position 3,5. Here is the result of filling in the whole grid: To understand this process, remember that the CYK algorithm works by building up parse trees for longer and longer fragments of the input. Each box in the upper-right part of the grid represents the root of one or more parse trees. The nonterminal(s) in a square is/are the root nonterminal(s) of those trees. The children of the root are the nonterminals in the pairs of squares, one to the left, and one below, that caused the nonterminal to be filled in. The five nonterminals in the upper-right part of our grid represent these five parse trees: CYK Self-Study #2L Use the statement CFG that was converted to CNF for CYK Self-Study #1 to parse the input x ++ y = z y ++ (i.e., build and fill in the grid). solution Accepting an InputL The CYK discovers that an input is in the language of a CFG iff the CFG's start nonterminal appears in the upper-right corner of the grid. For our example, F is the start nonterminal, and it does appear in square 1,6, so the example input is in the language of the CFG. The rightmost parse tree shown above is the (unique) parse tree for the example input. CYK Self-Study #3L Is the input x ++ y = z y ++ in the language of the simple statement grammar? (Justify your answer.) solution Contents Here's a very simple example, using a grammar that defines the language of balanced parentheses or square brackets, and running the parser on the input "( [ ] ) EOF". Note that in the examples on this page we will use sometimes name terminal symbols using single characters (such as: (, ), [, and ]) instead of the token names (lparen, rparen, etc). Also note that in the picture, the top of stack is to the left. Grammar:  S \longrightarrow \varepsilon \;|\; ( S ) \;|\; [ S ] Parse Table:  ( ) [ ] EOF S ( S ) \varepsilon [ S ] \varepsilon \varepsilon  Input seen so far stack Action ( S EOF pop, push ( S ) ( ( S ) EOF pop, scan ([ S ) EOF pop, push [ S ] ([ [ S ] ) EOF pop, scan ([] S ] ) EOF pop, push nothing ([] ] ) EOF pop, scan ([]) ) EOF pop, scan ([]) EOF EOF pop, scan ([]) EOF empty stack: input accepted Remember, it is not always possible to build a predictive parser given a CFG; only if the CFG is LL(1). For example, the following grammar is not LL(1) (but it is LL(2)):  S \longrightarrow ( S ) | [ S ] | ( ) | [ ] If we try to parse an input that starts with a left paren, we are in trouble! We don't know whether to choose the first production: S \longrightarrow ( S ), or the third one: S \longrightarrow ( ). If the next token is a right paren, we want to push "()". If the next token is a left paren, we want to push the three symbols "(S)". So here we need two tokens of look-ahead. Self-Study #1 Draw a picture like the one given above to illustrate what the parser for the grammar: S \longrightarrow \varepsilon | ( S ) | [ S ] does on the input: "[[]]". solution Grammar Transformations We need to answer two important questions: 1. How to tell whether a grammar is LL(1). 2. How to build the parse (or selector) table for a predictive parser, given an LL(1) grammar. It turns out that there is really just one answer: if we build the parse table and no element of the table contains more than one grammar rule right-hand side, then the grammar is LL(1). Before saying how to build the table we will consider two properties that preclude a context-free grammar from being LL(1): left-recursive grammars and grammars that are not left factored. We will also consider some transformations that can be applied to such grammars to make them LL(1). First, we will introduce one new definition: A nonterminal X is useless if either: 1. You can't derive a sequence that includes X, or 2. You can't derive a string from X (where "string" means epsilon or a sequence of terminals). Here are some examples of useless nonterminals : For case 1:  S \longrightarrow A B A \longrightarrow + | - | \varepsilon B \longrightarrow digit | B digit C \longrightarrow . B For case 2:  S \longrightarrow X | Y X \longrightarrow ( ) Y \longrightarrow ( Y Y ) Y just derives more and more nonterminals, so it is useless. From now on "context-free grammar" means a grammar without useless nonterms. Left RecursionL • A grammar G is recursive in a nonterminal X if X can derive a sequence of symbols that includes X, in one or more steps:$$X \stackrel{+}\Longrightarrow \alpha X \beta$$where \alpha and \beta are arbitrary sequences of symbols. • G is left recursive in nonterminal X if X can derive a sequence of symbols that starts with X, in one or more steps:$$X \stackrel{+}\Longrightarrow X \beta$$where \beta is an arbitrary sequence of symbols. • G is immediately left recursive in nonterminal X if X can derive a sequence of symbols that starts with X in one step:$$X \Longrightarrow X \beta$$i.e., the grammar includes the production: X \longrightarrow X \beta. In general, it is not a problem for a grammar to be recursive. However, if a grammar is left recursive, it is not LL(1). Fortunately, we can change a grammar to remove immediate left recursion without changing the language of the grammar. Here is how to do the transformation: Given two productions of the form:  A \longrightarrow A \alpha | \beta where: • A is a nonterminal • \alpha is a sequence of terminals and/or nonterminals • \beta is a sequence of terminals and/or nonterminals not starting with A Replace those two productions with the following three productions:  A \longrightarrow \beta A' A' \longrightarrow \alpha A' | \varepsilon where A' is a new nonterminal. Using this rule, we create a new grammar from a grammar with immediate left recursion. The new grammar is equivalent to the original one; i.e., the two grammars derive exactly the same sets of strings, but the new one is not immediately left recursive (and so has a chance of being LL(1)). To illustrate why the new grammar is equivalent to the original one, consider the parse trees that can be built using the original grammar: etc. Note that the derived strings are: • \beta • \beta \alpha • \beta \alpha \alpha • ... That is, they are of the form "\beta, followed by zero or more \alphas". The new grammar derives the same set of strings, but the parse trees have a different shape (the single \beta is derived right away, and then the zero or more \alphas): etc. Consider, for instance, the grammar for arithmetic expressions involving only subtraction:  Exp \longrightarrow Exp minus Factor | Factor Factor \longrightarrow intliteral | ( Exp ) Notice that the first rule ( Exp\longrightarrowExp minus Factor) has immediate left recursion, so this grammar is not LL(1). (For example, if the first token is intliteral, you don't know whether to choose the production Exp\longrightarrowExp minus Factor, or Exp\longrightarrowFactor. If the next token is minus, then you should choose Exp\longrightarrowExp minus Factor, but if the next token is EOF, then you should choose Exp\longrightarrowFactor. Using the transformation defined above, we remove the immediate left recursion, producing the following new grammar:  Exp \longrightarrow Factor Exp' Exp' \longrightarrow minus Factor Exp' \;|\; \varepsilon Factor \longrightarrow intliteral | ( Exp ) Let's consider what the predictive parser built using this grammar does when the input starts with an integer: • The predictive parser starts by pushing eof, then Exp onto the stack. Regardless of what the first token is, there is only one production with Exp on the left-hand side, so it will always pop the Exp from the stack and push Factor Exp' as its first action. • Now the top-of-stack symbol is the nonterminal Factor. Since the input is the intliteral token (not the ( token) it will pop the Factor and push intliteral. • The top-of-stack symbol is now a terminal (intliteral), which does match the current token, so the stack will be popped, and the scanner called to get the next token. • Now the top-of-stack symbol is nonterminal Exp'. We'll consider two cases: 1. The next token is minus. In this case, we pop Exp' and push minus Factor Exp'. 2. The next token is eof. In this case, we pop Exp' and push \varepsilon (i.e., push nothing). So with the new grammar, the parser is able to tell (using only one token look-ahead) what action to perform. Unfortunately, there is a major disadvantage of the new grammar, too. Consider the parse trees for the string 2 - 3 - 4 for the old and the new grammars: Before eliminating Left Recursion: After eliminating Left Recursion: The original parse tree shows the underlying structure of the expression; in particular it groups 2 - 3 in one subtree to reflect the fact that subtraction is left associative. The parse tree for the new grammar is a mess! Its subtrees don't correspond to the sub-expressions of 2 - 3 - 4 at all! Fortunately, we can design a predictive parser to create an abstract-syntax tree that does correctly reflect the structure of the parsed code even though the grammar's parse trees do not. Note that the rule for removing immediate left recursion given above only handled a somewhat restricted case, where there was only one left-recursive production. Here's a more general rule for removing immediate left recursion: • For every set of productions of the form:$$A \longrightarrow A \alpha_1 \; | \; A \alpha_2 \; | \ldots | \; A \alpha_m \; | \; \beta_1 \; | \ldots | \; \beta_n$$• Replace them with the following productions:$$A \longrightarrow \beta_1 A' \; | \; \beta_2 A' \; | \; \ldots \; | \; \beta_n A'A' \longrightarrow \alpha_1 A' | \ldots | \alpha_m A' | \varepsilon$$Note also that there are rules for removing non-immediate left recursion; for example, you can read about how to do that in the compiler textbook by Aho, Sethi & Ullman, on page 177. However, we will not discuss that issue here. A second property that precludes a grammar from being LL(1) is if it is not left factored, i.e., if a nonterminal has two productions whose right-hand sides have a common prefix. For example, the following grammar is not left factored: Exp \longrightarrow ( Exp ) | ( ) In this example, the common prefix is "(". This problem is solved by left-factoring, as follows: • Given a pair of productions of the form: A \longrightarrow \alpha \beta_1 | \alpha \beta_2 where \alpha is a sequence of terminals and/or nonterminals, and \beta_1 and beta_2 are sequences of terminals and/or nonterminals that do not have a common prefix (and one of the betas could be epsilon), • Replace those two production with: A \longrightarrow \alpha A' A' \longrightarrow \beta_1 | \beta_2 For example, consider the following productions: Exp \longrightarrow ( Exp ) \;|\; ( ) Using the rule defined above, they are replaced by: Exp \longrightarrow ( Exp' Exp' \longrightarrow Exp ) \;|\; ) Here's the more general algorithm for left factoring (when there may be more than two productions with a common prefix): For each nonterminal A: • Find the longest non-empty prefix \alpha that is common to 2 or more production right-hand sides. • Replace the productions:  A \longrightarrow \alpha \beta_1 | \ldots | \alpha \beta_m | y1 | .. | yn with:  A \longrightarrow \alpha A' | y_1 | \ldots | y_n A' \longrightarrow \beta_1 | \ldots | \beta_m Repeat this process until no nonterminal has two productions with a common prefix. Note that this transformation (like the one for removing immediate left recursion) has the disadvantage of making the grammar much harder to understand. However, it is necessary if you need an LL(1) grammar. Here's an example that demonstrates both left-factoring and immediate left-recursion removal: • The original grammar is:  Exp \longrightarrow ( Exp ) \;|\; Exp Exp \;|\; ( ) • After removing immediate left-recursion, the grammar becomes:  Exp \longrightarrow ( Exp ) Exp' \;|\; ( ) Exp' Exp' \longrightarrow Exp Exp' \;|\; \varepsilon • After left-factoring, this new grammar becomes:  Exp \longrightarrow ( Exp'' Exp'' \longrightarrow Exp ) Exp' \;|\; ) Exp' Exp' \longrightarrow Exp Exp' \;|\; \varepsilon solution TEST YOURSELF #2 Using the same grammar: exp ::= ( exp ) | exp exp | ( ), do left factoring first, then remove immediate left recursion. FIRST and FOLLOW Sets Recall: A predictive parser can only be built for an LL(1) grammar. A grammar is not LL(1) if it is: • Left recursive, or • not left factored. However, grammars that are not left recursive and are left factored may still not be LL(1). As mentioned earlier, to see if a grammar is LL(1), we try building the parse table for the predictive parser. If any element in the table contains more than one grammar rule right-hand side, then the grammar is not LL(1). To build the table, we must must compute FIRST and FOLLOW sets for the grammar. FIRST Sets Ultimately, we want to define FIRST sets for the right-hand sides of each of the grammar's productions. To do that, we define FIRST sets for arbitrary sequences of terminals and/or nonterminals, or \varepsilon (since that's what can be on the right-hand side of a grammar production). The idea is that for sequence \alpha, FIRST(\alpha) is the set of terminals that begin the strings derivable from \alpha, and also, if \alpha can derive \varepsilon, then \varepsilon is in FIRST(\alpha). Using derivation notation:$$\mathrm{FIRST}(\alpha) = \left\{ \; t \; \middle| \; \left( t \; \in \Sigma \wedge \alpha \stackrel{*}\Longrightarrow t \beta \right) \vee \left( t = \varepsilon \wedge \alpha \stackrel{*}\Longrightarrow \varepsilon \right) \right\} $$To define FIRST(\alpha) for arbitrary \alpha, we start by defining FIRST(X), for a single symbol X (a terminal, a nonterminal, or \varepsilon): 1. X \in \Sigma (e.g. X is a terminal): FIRST(X) = {X} 2. X = \varepsilon: FIRST(X) = {\varepsilon} 3. X \in N (e.g. X is a nonterminal): we must look at all grammar productions with X on the left, i.e., productions of the form: X \longrightarrow Y1 Y2 Y3 ... Yk where each Yk is a single terminal or nonterminal (or there is just one Y, and it is \varepsilon). For each such production, we perform the following actions: • Put FIRST(Y1) - {\varepsilon} into FIRST(X). • If \varepsilon is in FIRST(Y1), then put FIRST(Y2) - {epsilon} into FIRST(X). • If epsilon is in FIRST(Y2), then put FIRST(Y3) - {\varepsilon} into FIRST(X). • etc... • If \varepsilon is in FIRST(Yi) for 1 \leq i \leq k (all production right-hand sides)) then put \varepsilon into FIRST(X). For example, consider computing FIRST sets for each of the nonterminals in the following grammar:  Exp \longrightarrow Term Exp' Exp' \longrightarrow minus Term Exp' | \varepsilon Term \longrightarrow Factor Term' Term' \longrightarrow divide Factor Term' | \varepsilon Factor \longrightarrow intlit | lparens Exp rparens Here are the FIRST sets (starting with nonterminal factor and working up, since we need to know FIRST(factor) to compute FIRST(term), and we need to know FIRST(term) to compute FIRST(exp)): Once we have computed FIRST(X) for each terminal and nonterminal X, we can compute FIRST(\alpha) for every production's right-hand-side \alpha. In general, \alpha will be of the form: X1 X2 ... Xn where each X is a single terminal or nonterminal, or there is just one X and it is \varepsilon. The rules for computing FIRST(\alpha) are essentially the same as the rules for computing the first set of a nonterminal: 1. Put FIRST(X1) - {\varepsilon} into FIRST(\alpha) 2. If \varepsilon is in FIRST(X1) put FIRST(X2) into FIRST(\alpha). 3. etc... 4. If \varepsilon is in the FIRST set for every Xk, put \varepsilon into FIRST(\alpha). For the example grammar above, here are the FIRST sets for each production right-hand side: Why do we care about the FIRST(\alpha) sets? During parsing, suppose that there are two productions:  Production 1: A \longrightarrow \alpha Production 2: A \longrightarrow \beta Consider the situation when the top-of-stack symbol is A and the current token is a. If a \in FIRST(\alpha), choose production 1: pop, push \alpha. If, on the other hand, a \in FIRST(\beta), choose production 2 : pop, push \beta. We haven't yet given the rules for using FIRST and FOLLOW sets to determine whether a grammar is LL(1); however, you might be able to guess based on this discussion, that if a is in both FIRST(\alpha) and FIRST(\beta), the grammar is not LL(1). FOLLOW sets are only defined for single nonterminals. The definition is as follows: For a nonterminal A, FOLLOW(A) is the set of terminals that can appear immediately to the right of A in some partial derivation; i.e., terminal t is in FOLLOW(A) if  S \stackrel{+}\Longrightarrow\ldots A \; \mathrm{t} \ldots  where t is a terminal. If A can be the rightmost symbol in a derivation, then EOF is in FOLLOW(A). It is worth noting that \varepsilon is never in a FOLLOW set. Using notation:$$\mathrm{FOLLOW}(A) = \left\{ \; \mathrm{t} \; \middle| \; \left( \mathrm{t} \; \in \Sigma \wedge S \stackrel{+}\Longrightarrow \alpha \; A \; \mathrm{t} \; \beta \right) \vee \left( \mathrm{t} = \mathrm{EOF} \wedge S \stackrel{*}\Longrightarrow \alpha A \right) \right\} 

Here are some pictures illustrating the conditions under which symbols a, c, and EOF are in the FOLLOW set of nonterminal A:

Figure 1

Figure 2

Figure 2

How to compute FOLLOW(A) for each nonterminal A:
• If A is the start nonterminal, put EOF in FOLLOW(A) (like S in Fig. 3).
• Now find the productions with A on the right-hand-side:
• For each production $X$ $\longrightarrow$ $\alpha$ $A$ $\beta$, put FIRST($\beta$) - {$\varepsilon$} in FOLLOW($A$) -- see Fig. 1.
• If epsilon is in FIRST($\beta$) then put FOLLOW($X$) into FOLLOW(A) -- see Fig. 2.
• For each production $X$ $\longrightarrow$ $\alpha$ $A$, put FOLLOW(X) into FOLLOW(A) -- see Fig. 3, and Fig. 4 below:

Figure 4

It is worth noting that:

• To compute FIRST(A) you must look for A on a production's left-hand side.
• To compute FOLLOW(A) you must look for A on a production's right-hand side.
• FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, $\varepsilon$ for FIRST sets, and EOF for follow sets). Nonterminals are never in a FIRST or a FOLLOW set; $\varepsilon$ is never in a FOLLOW set.

Here's an example of FOLLOW sets (and the FIRST sets we need to compute them). In this example, nonterminals are upper-case letters, and terminals are lower-case letters.

 $S$ $\longrightarrow$ $B$ c $\;|\;$ $D$ $B$ $B$ $\longrightarrow$ a b $\;|\;$ c $S$ $D$ $\longrightarrow$ d $\;|\;$ $\varepsilon$
 $\alpha$ FIRST($\alpha$) FOLLOW($\alpha$) $D$ {d, $\varepsilon$} {a, c} $B$ {a, c} {c, EOF} $S$ {a, c, d} {EOF, c} Note that FOLLOW($S$) always includes EOF

• Suppose, during parsing, we have some $X$ at the top-of-stack, and a is the current token.
• We need to replace $X$ on the stack with the right-hand side of a production $X \longrightarrow \alpha$. What if $X$ has an additional production $X \longrightarrow \beta$. Which one should we use?
• We've already said that if a is in FIRST($\alpha$), but not in FIRST($\beta$), then we want to choose $X \longrightarrow \alpha$.
• But what if a is not in FIRST($\alpha$) or FIRST($\beta$)? If $\alpha$ or $\beta$ can derive $\varepsilon$, and a is in FOLLOW($X$), then we still have hope that the input will be accepted: If $\alpha$ can derive $\varepsilon$ (i.e., $\varepsilon \in$ FIRST($\alpha$), then we want to choose $X \longrightarrow \alpha$ (and similarly if $\beta$ can derive $\varepsilon$). The idea is that since $\alpha$ can derive $\varepsilon$, it will eventually be popped from the stack, and if we're lucky, the next symbol down (the one that was under the $X$) will be a.

TEST YOURSELF #3

Here are five grammar productions for (simplified) method headers:

1. methodHeader -> VOID ID LPAREN paramList RPAREN

2. paramList    -> epsilon
3. paramList    -> nonEmptyParamList

4. nonEmptyParamList -> ID ID
5. nonEmptyParamList -> ID ID COMMA nonEmptyParamList


Question 1: Compute the FIRST and FOLLOW sets for the three nonterminals, and the FIRST sets for each production right-hand side.

Question 2: Draw a picture to illustrate what the predictive parser will do, given the input sequence of tokens: "VOID ID LPAREN RPAREN EOF". Include an explanation of how the FIRST and FOLLOW sets are used when there is a nonterminal on the top-of-stack that has more than one production.

solution

How to Build Parse Tables

Recall that the form of the parse table is:

Table entry[$X$,a] is either empty (if having $X$ on top of stack and having a as the current token means a syntax error) or contains the right-hand side of a production whose left-hand-side nonterminal is $X$ -- that right-hand side is what should be pushed.

To build the table, we fill in the rows one at a time for each nonterminal $X$ as follows:

for each production $X \longrightarrow \alpha$:
for each terminal t in FIRST($\alpha$):
put $\alpha$ in Table[$X$,t]

if $\epsilon \in$ FIRST($\alpha$) then:
for each terminal t in FOLLOW($X$):
put $\alpha$ in Table[$X$,t]

The grammar is not LL(1) iff there is more than one entry for any cell in the table.

Let's try building a parse table for the following grammar:
 $S$ $\longrightarrow$ $B$ c $\;|\;$ $D$ $B$ $B$ $\longrightarrow$ a b $\;|\;$ c $S$ $D$ $\longrightarrow$ d $\;|\;$ $\varepsilon$
First we calculate the FIRST and FOLLOW sets:
 $\alpha$ FIRST($\alpha$) FOLLOW($\alpha$) $D$ {d, $\varepsilon$} {a, c} $B$ {a, c} {c, EOF} $S$ {a, c, d} {EOF, c} $B$ c {a, c} - $D$ $B$ {d, a, c} - a b {a} - c $S$ {c} - d {d} - $\varepsilon$ {$\varepsilon$} -
Then we use those sets to start filling in the parse table:
Not all entries have been filled in, but already we can see that this grammar is not LL(1) since there are two entries in table[S,a] and in table[S,c].

Here's how we filled in this much of the table:

1. First, we considered the production S -> B c. FIRST(Bc) = { a, c }, so we put the production's right-hand side (B c) in Table[S, a] and in Table[S, c]. FIRST(Bc) does not include epsilon, so we're done with that production.
2. Next, we considered the production S -> D B . FIRST(DB) = { d, a, c }, so we put the production's right-hand side (D B) in Table[S, d], Table[S, a], and Table[S, c].
3. Next, we considered the production D -> epsilon. FIRST(epsilon) = { epsilon }, so we must look at FOLLOW(D). FOLLOW(D) = { a, c }, so we put the production's right-hand side (epsilon) in Table[D, a] and Table[D, c}.

TEST YOURSELF #4

Finish filling in the parse table given above.

solution

How to Code a Predictive Parser

Now, suppose we actually want to code a predictive parser for a grammar that is LL(1). The simplest idea is to use a table-driven parser with an explicit stack. Here's pseudo-code for a table-driven predictive parser:

   Stack.push(EOF);
Stack.push(start-nonterminal);
currToken = scan();

while (! Stack.empty()) {
topOfStack = Stack.pop();
if (isNonTerm(topOfStack)) {
// top of stack symbol is a nonterminal
p = table[topOfStack, currToken];
if (p is empty) report syntax error and quit
else {
// p is a production's right-hand side
push p, one symbol at a time, from right to left
}
}
else {
// top of stack symbol is a terminal
if (topOfStack == currToken) currToken = scan();
else report syntax error and quit
}
}


TEST YOURSELF #5

Here is a CFG (with rule numbers):

S -> epsilon               (1)
|  X Y Z                 (2)

X -> epsilon               (3)
|  X S                   (4)

Y -> epsilon               (5)
|  a Y b                 (6)

Z -> c Z                   (7)
|  d                     (8)


Question 1(a): Compute the First and Follow Sets for each nonterminal.

Question 1(b): Draw the Parse Table.

solution

## Contents

LR Parsing Overview

There are several different kinds of bottom-up parsing. We will discuss an approach called LR parsing, which includes SLR, LALR, and LR parsers. LR means that the input is scanned left-to-right, and that a rightmost derivation, in reverse, is constructed. SLR means "simple" LR, and LALR means "look-ahead" LR.

Every SLR(1) grammar is also LALR(1), and every LALR(1) grammar is also LR(1), so SLR is the most limited of the three, and LR is the most general. In practice, it is pretty easy to write an LALR(1) grammar for most programming languages (i.e., the "power" of an LR parser isn't usually needed). A disadvantage of LR parsers is that their tables can be very large. Therefore, parser generators like Yacc and Java Cup produce LALR(1) parsers.

Let's start by considering the advantages and disadvantages of the LR parsing family:

• Almost all programming languages have LR grammars.
• LR parsers take time and space linear in the size of the input (with a constant factor determined by the grammar).
• LR is strictly more powerful than LL (for example, every LL(1) grammar is also both LALR(1) and LR(1), but not vice versa).
• LR grammars are more "natural" than LL grammars (e.g., the grammars for expression languages get mangled when we remove the left recursion to make them LL(1), but that isn't necessary for an LALR(1) or an LR(1) grammar).
• Although an LR grammar is usually easier to understand than the corresponding LL grammar, the parser itself is harder to understand and to write (thus, LR parsers are built using parser generators, rather than being written by hand).
• When we use a parser generator, if the grammar that we provide is not LALR(1), it can be difficult to figure out how to fix that.
• Error repair may be more difficult using LR parsing than using LL.
• Table sizes may be larger (about a factor of 2) than those used for LL parsing.

Recall that top-down parsers use a stack. The contents of the stack represent a prediction of what the rest of the input should look like. The symbols on the stack, from top to bottom, should "match" the remaining input, from first to last token. For example, earlier, we looked at the example grammar

Grammar:

 $S$ $\longrightarrow$ $\varepsilon$ ( $S$ ) [ $S$ ]

and parsed the input string ([]). At one point during the parse, after the first parenthesis has been consumed, the stack contains

    [ S ] ) EOF
(with the top-of-stack at the left). This is a prediction that the remaining input will start with a '[', followed by zero or more tokens matching an S, followed by the tokens ']' and ')', in that order, followed by end-of-file.

Bottom-up parsers also use a stack, but in this case, the stack represents a summary of the input already seen, rather than a prediction about input yet to be seen. For now, we will pretend that the stack symbols are terminals and nonterminals (as they are for predictive parsers). This isn't quite true, but it makes our introduction to bottom-up parsing easier to understand.

A bottom-up parser is also called a "shift-reduce" parser because it performs two kind of operations, shift operations and reduce operations. A shift operation simply shifts the next input token from the input to the top of the stack. A reduce operation is only possible when the top N symbols on the stack match the right-hand side of a production in the grammar. A reduce operation pops those symbols off the stack and pushes the non-terminal on the left-hand side of the production.

One way to think about LR parsing is that the parse tree for a given input is built, starting at the leaves and working up towards the root. More precisely, a reverse rightmost derivation is constructed.

Recall that a derivation (using a given grammar) is performed as follows:

1. start with the start symbol (i.e., the current string is the start symbol)
2. repeat:
• choose a nonterminal X in the current string
• choose a grammar rule X $\longrightarrow$ $\alpha$
• replace X in the current string with $\alpha$
until there are no more nonterminals in the current string

A rightmost derivation is one in which the rightmost nonterminal is always the one chosen.

Rightmost Derivation Example

CFG

 $E$ $\longrightarrow$ $E$ + $T$ | $T$ $T$ $\longrightarrow$ $T$ * $F$ | $F$ $F$ $\longrightarrow$ id | ( $E$ )

Rightmost derivation

$E$
1. $\Longrightarrow$ $E$ + $T$
2. $\Longrightarrow$ $E$ + $T$ * $F$
3. $\Longrightarrow$ $E$ + $T$ * id
4. $\Longrightarrow$ $E$ + $F$ * id
5. $\Longrightarrow$ $E$ + id * id
6. $\Longrightarrow$ $T$ + id * id
7. $\Longrightarrow$ $F$ + id * id
8. $\Longrightarrow$ id + id * id

In this example, the nonterminal that is chosen at each step is in red, and each derivation step is numbered. The corresponding bottom-up parse is shown below by showing the parse tree with its edges numbered to show the order in which the tree was built (e.g., the first step was to add the nonterminal F as the parent of the leftmost parse-tree leaf "id", and the last step was to combine the three subtrees representing "id", "+", and "id * id" as the children of a new root node E).

Note that both the rightmost derivation and the bottom-up parse have 8 steps. Step 1 of the derivation corresponds to step 8 of the parse; step 2 of the derivation corresponds to step 7 of the parse; etc. Each step of building the parse tree (adding a new nonterminal as the parent of some existing subtrees) is called a reduction (that's where the "reduce" part of "shift-reduce" parsing comes from).

Basic LR Parsing AlgorithmL

All LR parsers use the same basic algorithm:

Based on:
• the top-of-stack symbol and
• the current input symbol (token) and
• the entry in one of the parse tables (indexed by the top-of-stack and current-input symbols)
the parser performs one of the following actions:
1. shift: push the current input symbol onto the stack, and go on to the next input symbol (i.e., call the scanner again)
2. reduce: a grammar rule's right-hand side is on the top of the stack! pop it off and push the grammar rule's left-hand-side nonterminal
3. accept: accept the input (parsing has finished successfully)
4. reject: the input is not syntactically correct

The difference between SLR, LALR, and LR parsers is in the tables that they use. Those tables use different techniques to determine when to do a reduce step, and, if there is more than one grammar rule with the same right-hand side, which left-hand-side nonterminal to push.

Example

Here are the steps the parser would perform using the grammar of arithmetic expressions with + and * given above, if the input is

id + id * id
 Stack Input Action id + id * id shift (id) id + id * id reduce by $F$ $\longrightarrow$ id $F$ + id * id reduce by $T$ $\longrightarrow$ $F$ $T$ + id * id reduce by $E$ $\longrightarrow$ $T$ $E$ + id * id shift(+) $E$ + id * id shift(id) $E$ + id * id reduce by $F$ $\longrightarrow$ id $E$ + $F$ * id reduce by $T$ $\longrightarrow$ $F$ $E$ + $T$ * id shift(*) $E$ + $T$ * id shift(id) $E$ + $T$ * id reduce by $F$ $\longrightarrow$ id $E$ + $T$ * $F$ reduce by $T$ $\longrightarrow$ $T$ * $F$ $E$ + $T$ reduce by $E$ $\longrightarrow$ $E$ + $T$ $E$ accept

(NOTE: the top of stack is to the right; the reverse rightmost derivation is formed by concatenating the stack with the remaining input at each reduction step)

Note that in step 8 the top of the stack contained E + T, which is the right-hand side of the first grammar rule. However, the parser does not do a reduce step at that point; instead it does a shift, which is the right thing to do since (as you can see if you look back at the parse tree given above), that particular "E + T" is not grouped together (because the grammar reflects the fact that multiplication has higher precedence than addition).

Parse Tables

As mentioned above, the symbols pushed onto the parser's stack are not actually terminals and nonterminals. Instead, they are states, that correspond to a finite-state machine that represents the parsing process (more on this soon).

All LR Parsers use two tables: the action table and the goto table. The action table is indexed by the top-of-stack symbol and the current token, and it tells which of the four actions to perform: shift, reduce, accept, or reject. The goto table is used during a reduce action as explained below.

Above we said that a shift action means to push the current token onto the stack. In fact, we actually push a state symbol onto the stack. Each "shift" action in the action table includes the state to be pushed.

Above, we also said that when we reduce using the grammar rule A → alpha, we pop alpha off of the stack and then push A. In fact, if alpha contains N symbols, we pop N states off of the stack. We then use the goto table to know what to push: the goto table is indexed by state symbol t and nonterminal A, where t is the state symbol that is on top of the stack after popping N times.

Here's pseudo code for the parser's main loop:

push initial state s0
a = scan()
do forever
t = top-of-stack (state) symbol
switch action[t, a] {
case shift s:
push(s)
a = scan()
case reduce by A → alpha:
for i = 1 to length(alpha) do pop() end
t = top-of-stack symbol
push(goto[t, A])
case accept:
return( SUCCESS )
case error:
call the error handler
return( FAILURE )
}
end do


Remember, all LR parsers use this same basic algorithm. As mentioned above, for all LR parsers, the states that are pushed onto the stack represent the states in an underlying finite-state machine. Each state represents "where we might be" in a parse; each "place we might be" is represented (within the state) by an item. What's different for the different LR parsers is:

• the definition of an item
• the number of states in the underlying finite-state machine
• the amount of information in a state
Contents
Topic IntroductionL

Our journey through the syntactic analysis has, thus far, gone as follows: We first showed how to define the parse tree, then explored how to actually build it. Likewise, we can define the AST through a syntax-directed definition, so we now turn to exploring how to actually build the AST using the two parsing strategies (top-down and bottom-up) that we presented. The actual task of producing the AST according to the syntax-directed definition is known as syntax-directed translation (SDT).

Top-Down SDTL

Now we consider how to implement a syntax-directed translation using a predictive parser. It is not obvious how to do this, since the predictive parser works by building the parse tree top-down, while the syntax-directed translation needs to be computed bottom-up (since the translations attached to LHS symbol translations depend on the translations attached to RHS symbols). Of course, we could design the parser to actually build the parse tree (top-down), then use the translation rules to build the translation (bottom-up). However, that would not be very efficient.

Instead, we avoid explicitly building the parse tree by giving the parser a second stack called the semantic stack:

• The semantic stack holds nonterminals' translations; when the parse is finished, it will hold just one value: the translation of the root nonterminal (which is the translation of the whole input).
• Values are pushed onto the semantic stack (and popped off) by adding actions to the grammar rules. The action for one rule must:
• Pop the translations of all right-hand-side nonterminals.
• Compute and push the translation of the left-hand-side nonterminal.
• The actions themselves are represented by action numbers, which become part of the right-hand sides of the grammar rules. They are pushed onto the (normal) stack along with the terminal and nonterminal symbols. When an action number is the top-of-stack symbol, it is popped and the action is carried out.

So what actually happens is that the action for a grammar rule $X \longrightarrow Y_1 \; Y_2 \; \ldots \; Y_n$ is pushed onto the (normal) stack when the derivation step $X \longrightarrow Y_1 Y_2 \ldots Y_n$ is made, but the action is not actually performed until complete derivations for all of the $Y$s have been carried out.

Counting ParenthesesL

For example, consider the following syntax-directed translation for the language of balanced parentheses and square brackets. The translation of a string in the language is the number of parenthesis pairs in the string.

 CFG Transition Rules $Exp$ $\longrightarrow$ $\varepsilon$ $Exp$.trans = 0 | ( $Exp$ ) $Exp_1$.trans = $Exp_2$.trans + 1 | [ $Exp$ ] $Exp_1$.trans = $Exp_2$.trans

The first step is to replace the transition rules with translation actions. Each action must:

• Pop all right-hand-side nonterminals' translations from the semantic stack.
• Compute and push the left-hand-side nonterminal's translation.
Here are the transition actions:

 CFG Transition Actions $Exp$ $\longrightarrow$ $\varepsilon$ push 0; | ( $Exp$ ) exp2trans = pop() ; push(exp2trans + 1) | [ $Exp$ ] exp2trans = pop() ; push(exp2trans)

Next, each action is represented by a unique action number, and those action numbers become part of the grammar rules:

 CFG with Embedded Actions $Exp$ $\longrightarrow$ $\varepsilon$ #1 | ( $Exp$ ) #2 | [ $Exp$ ] #3

where

 #1 is push 0; #2 is exp2trans = pop() ; push(exp2trans + 1) #3 is exp2trans = pop() ; push(exp2trans)

Note that since action #3 just pushes exactly what is popped, that action is redundant, and it is not necessary to have any action associated with the third grammar rule. Here's a picture that illustrates what happens when the input "([])" is parsed (assuming that we have removed action #3):

 Input so far Stack Semantic Stack Action ( $Exp$ pop, push ( exp ) #2 ( ( $Exp$ ) 2 pop, scan ([ $Exp$ ) 2 pop, push "[ exp ]" ([ [ $Exp$ ] ) 2 pop, scan ([] $Exp$ ] ) 2 pop, push $\varepsilon$ 1 ([] 1 ] ) 2 pop, do action 1 ([] ] ) 2 0 pop, scan ([] ) 2 0 pop, scan ([]) 2 0 pop, do action 2 ([]) 1 pop, scan ([]) empty stack, accept

translation of input = 1

In the example above, there is no grammar rule with more than one nonterminal on the right-hand side. If there were, the translation action for that rule would have to do one pop for each right-hand-side nonterminal. For example, suppose we are using a grammar that includes the rule:

MethodBody $\longrightarrow$ { VarDecls Stmts }
and that the syntax-directed translation is counting the number of declarations and statements in each method body (so the translation of VarDecls is the number of derived declarations, the translation of Stmts is the number of derived statements, and the translation of MethodBody is the number of derived declarations and statements).
CFG Rule:              methodBody -> { varDecls stmts }
Translation Rule:      methodBody.trans = varDecls.trans + stmts.trans
Translation Action:    stmtsTrans = pop(); declsTrans = pop();
push( stmtsTrans + declsTrans );
CFG rule with Action:  methodBody -> { varDecls stmts } #1
#1: stmtsTrans = pop();
declsTrans = pop();
push( stmtsTrans + declsTrans );



Note that the right-hand-side nonterminals' translations are popped from the semantic stack right-to-left. That is because the predictive parser does a leftmost derivation, so the varDecls nonterminal gets "expanded" first; i.e., its parse tree is created before the parse tree for the stmts nonterminal. This means that the actions that create the translation of the varDecls nonterminal are performed first, and thus its translation is pushed onto the semantic stack first.

Another issue that has not been illustrated yet arises when a left-hand-side nonterminal's translation depends on the value of a right-hand-side terminal. In that case, it is important to put the action number before that terminal symbol when incorporating actions into grammar rules. This is because a terminal symbol's value is available during the parse only when it is the "current token". For example, if the translation of an arithmetic expression is the value of the expression:

 CFG Rule: Factor $\longrightarrow$ intlit Translation Rule: factor.trans = intlit.value Translation Action: push( intlit.value ) CFG rule with Action: Factor $\longrightarrow$ #1 intlit // action BEFORE terminal #1: push( currToken.value )
SELF-STUDY #1L

For the following grammar, give (a) translation rules, (b) translation actions with numbers, and (c) a CFG with action numbers, so that the translation of an input expression is the value of the expression. Do not worry about the fact that the grammar is not LL(1).

 Exp $\longrightarrow$ Exp plus Term $|$ Exp minus Term $|$ Term Term $\longrightarrow$ Term times Factor $|$ Term divide Factor $|$ Factor Factor $\longrightarrow$ intlit $|$ lparen Exp rparen

solution

## Handling Non-LL(1) Grammars

Recall that a non-LL(1) grammar must be transformed to an equivalent LL(1) grammar if it is to be parsed using a predictive parser. Recall also that the transformed grammar usually does not reflect the underlying structure the way the original grammar did. For example, when left recursion is removed from the grammar for arithmetic expressions, we get grammar rules like this:

 CFG $Exp$ $\longrightarrow$ $Term$ $Exp'$ $Exp'$ $\longrightarrow$ $\varepsilon$ | + $Term$ $Exp'$
It is not at all clear how to define a syntax-directed translation for rules like these. The solution is to define the syntax-directed translation using the original grammar (define translation rules, convert them to actions that push and pop using the semantic stack, and then incorporate the action numbers into the grammar rules). Then convert the grammar to be LL(1), treating the action numbers just like terminal grammar symbols!

For example:

Non-LL(1) Grammar Rules With Actions  $Exp$ $\longrightarrow$ $Exp$ + $Term$ #1 | $Term$ $Term$ $\longrightarrow$ $Term$ * $Factor$ #2 | $\mathit{Factor}$
 #1 is TTrans = pop() ; ETrans = pop() ; push(ETrans + TTrans); #2 is FTrans = pop() ; TTrans = pop() ; push(TTrans * FTrans);
After Removing Immediate Left Recursion  $Exp$ $\longrightarrow$ $Term$ $Exp'$ $Exp'$ $\longrightarrow$ + $Term$ #1 $Exp'$ | $\varepsilon$ $\mathit{Term}$ $\longrightarrow$ $\mathit{Factor}$ $\mathit{Term}'$ $\mathit{Term}'$ $\longrightarrow$ * $\mathit{Factor}$ 2 $\mathit{Term}'$ | $\varepsilon$

SELF-STUDY #2

Transform the grammar rules with actions that you wrote for the "Test Yourself #1" exercise to LL(1) form. Trace the actions of the predictive parser on the input 2 + 3 * 4.

solution

Bottom-Up SDTL

In contrast to top-down SDT, performing SDT bottom-up SDT is fairly straightforward, since RHS symbol translations will naturally be available at the time of the RHS symbol translations. Consequently, we will not cover the implementation in very much depth. It is sufficient to simply note that when a single grammar production has been reduced according to the grammar, we can simply fire the corresponding rule from the SDD, referencing the position of the RHS symbol according to it's appearance in the production.

SummaryL

A syntax-directed translation is used to define the translation of a sequence of tokens to some other value, based on a CFG for the input. A syntax-directed translation is defined by associating a translation rule with each grammar rule. A translation rule defines the translation of the left-hand-side nonterminal as a function of the right-hand-side nonterminals' translations, and the values of the right-hand-side terminals. To compute the translation of a string, build the parse tree, and use the translation rules to compute the translation of each nonterminal in the tree, bottom-up; the translation of the string is the translation of the root nonterminal.

There is no restriction on the type of a translation; it can be a simple type like an integer, or a complex type list an abstract-syntax tree.

To implement a syntax-directed translation using a predictive parser, the translation rules are converted to actions that manipulate the parser's semantic stack. Each action must pop all right-hand-side nonterminals' translations from the semantic stack, then compute and push the left-hand-side nonterminal's translation. Next, the actions are incorporated (as action numbers) into the grammar rules. Finally, the grammar is converted to LL(1) form (treating the action numbers just like terminal or nonterminal symbols).

Contents

Symbol Table Introduction

The parser ensures that the input program is syntactically correct, but there are other kinds of correctness that it cannot (or usually does not) enforce. For example:

• A variable should not be declared more than once in the same scope.
• A variable should not be used before being declared.
• The type of the left-hand side of an assignment should match the type of the right-hand side.
• Methods should be called with the right number and types of arguments.

The next phase of the compiler after the parser, sometimes called the static semantic analyzer is in charge of checking for these kinds of errors. The checks can be done in two phases, each of which involves traversing the abstract-syntax tree created by the parser:

1. For each scope in the program: Process the declarations, adding new entries to the symbol table and reporting any variables that are multiply declared; process the statements, finding uses of undeclared variables, and updating the "ID" nodes of the abstract-syntax tree to point to the appropriate symbol-table entry.
2. Process all of the statements in the program again, using the symbol-table information to determine the type of each expression, and finding type errors.
Below, we will consider how to build symbol tables and how to use them to find multiply-declared and undeclared variables. We will then consider type checking.

Symbol Table Overview

The purpose of the symbol table is to keep track of names declared in the program. This includes names of classes, fields, methods, and variables. Each symbol table entry associates a set of attributes with one name; for example:

• which kind of name it is
• what is its type
• what is its nesting level
• where will it be found at runtime.
One factor that will influence the design of the symbol table is what scoping rules are defined for the language being compiled. Let's consider some different kinds of scoping rules before continuing our discussion of symbol tables.

Scoping

A language's scoping rules tell you when you're allowed to reuse a name, and how to match uses of a name to the corresponding declaration. In most languages, the same name can be declared multiple times under certain circumstances. In C++ you can use the same name in more than one declaration if the declarations involve different kinds of names. For example, you can use the same name for a class, a field of the class, and a local variable of a member function (this is not recommended, but it is legal):

class Test{
private:
int Test;
void func() {
double Test;  // Could have been declared int
}
};


In C++ functions (and Java methods), you can also use the same name more than once in a scope as long as the number and/or types of parameters are unique (this is called overloading).

In C and C++, but not in Java, you can declare variables with the same name in different blocks. A block is a piece of code inside curly braces; for example, in an if or a loop. The following is a legal C or C++ function, but not a legal Java method:

void f(int k) {
int x = 0;       /* x is declared here */

while (...) {
int x = 1;   /* another x is declared here */
...
if (...) {
float x = 5.5;  /* and yet another x is declared here! */
...
}
}
}

As mentioned above, the scopinge rules of a language determine which declaration of a named object corresponds to each use. C, C++, and Java use what is called static scoping; that means that the correspondence between uses and declarations is made at compile time. C and C++ use the "most closely nested" rule to match nested declarations to their uses: a use of variable x matches the declaration in the most closely enclosing scope such that the declaration precedes the use. In C and C++, there is one, outermost scope that includes the names of the global variables (the variables that are declared outside the functions) and the names of the functions that are not part of any class. Each function has one or more scopes. Both C and C++ have one scope for the parameters and the "top-level" declarations, plus one for each block in the function (delimited by curly braces). In addition, C++ has a scope for each for loop: in C++ (but not in C) you can declare variables in the for-loop header.

In the example given above, the outermost scope includes just the name "f", and function f itself has three scopes:

1. The outer scope for f includes parameter k, and the variable x that is initialized to 0.
2. The next scope is for the body of the while loop, and includes the variable x that is initialized to 1.
3. The innermost scope is for the body of the if, and includes the variable x that is initialized to 5.5.
So a use of variable x inside the loop, but not inside the if, matches the declaration in the loop (has the value 1). A use of x outside the loop (either before or after the loop) matches the declaration at the beginning of the function (has the value 0).

TEST YOURSELF #1

Question 1: Consider the names declared in the following code. For each, determine whether it is legal according to the rules used in Java.

class animal {
// methods
void attack(int animal) {
for (int animal=0; animal<10; animal++) {
int attack;
}
}

int attack(int x) {
for (int attack=0; attack<10; attack++) {
int animal;
}
}

void animal() { }

// fields
double attack;
int attack;
int animal;
}


Question 2: Consider the following C++ code. For each use of a name, determine which declaration it corresponds to (or whether it is a use of an undeclared name).

int k=10, x=20;

void foo(int k) {
int a = x;
int x = k;
int b = x;
while (...) {
int x;
if (x == k) {
int k, y;
k = y = x;
}
if (x == k) {
int x = y;
}
}
}


Not all languages use static scoping. Lisp, APL, Perl (depending on how a variable is declared), and Snobol use what is called dynamic scoping. A use of a variable that has no corresponding declaration in the same function corresponds to the declaration in the most-recently-called still active function. For example, consider the following code:

void main() {
f1();
f2();
}

void f1() {
int x = 10;
g();
}

void f2() {
String x = "hello";
f3();
g();
}

void f3() {
double x = 30.5;
}

void g() {
print(x);
}

Under dynamic scoping this program outputs "10 hello". The first call to g comes from f1, whose copy of x has value 10. The next call to g comes from f2. Although f3 is called by f2 before it calls g, the call to f3 is not active when g is called; therefore, the use of x in g matches the declaration in f2, and "hello" is printed.

TEST YOURSELF #2

Assuming that dynamic scoping is used, what is output by the following program?

void main() {
int x = 0;
f1();
g();
f2();
}

void f1() {
int x = 10;
g();
}

void f2() {
int x = 20;
f1();
g();
}

void g() {
print(x);
}


solution

It is generally agreed that dynamic scoping is a bad idea; it can make a program very difficult to understand, because a single use of a variable can correspond to many different declarations (with different types)! The languages that use dynamic scoping are all old languages; recently designed languages all use static scoping.

Another issue that is handled differently by different languages is whether names can be used before they are defined. For example, in Java, a method or field name can be used before the definition appears, but this is not true for a variable:

class Test {
void f() {
val = 0;   // field val has not yet been declared -- OK
g();       // method g has not yet been declared  -- OK
x = 1;     // variable x has not yet been declared -- ERROR!
int x;
}

void g() {}
int val;
}


In what follows, we will assume that we are dealing with a language that:

• uses static scoping
• requires that all names be declared before they are used
• does not allow multiple declarations of a name in the same scope (even for different kinds of names)
• does allow the same name to be declared in multiple nested scopes (but only once per scope)
• uses the same scope for a method's parameters and for the local variables declared at the beginning of the method

Symbol Table Implementations

In addition to the assumptions listed at the end of the previous section, we will assume that:

• The symbol table will be used to answer two questions:
1. Given a declaration of a name, is there already a declaration of the same name in the current scope (i.e., is it multiply declared)?
2. Given a use of a name, to which declaration does it correspond (using the "most closely nested" rule), or is it undeclared?
• The symbol table is only needed to answer those two questions (i.e., once all declarations have been processed to build the symbol table, and all uses have been processed to link each ID node in the abstract-syntax tree with the corresponding symbol-table entry, the symbol table itself is no longer needed).
Given these assumptions, the symbol-table operations we will need are:
1. Look up a name in the current scope only (to check if it is multiply declared).
2. Look up a name in the current and enclosing scopes (to check for a use of an undeclared name, and to link a use with the corresponding symbol-table entry).
3. Insert a new name into the symbol table with its attributes.
4. Do what must be done when a new scope is entered.
5. Do what must be done when a scope is exited.
We will look at two ways to design a symbol table: a list of hashtables, and a hashtable of lists. For each approach, we will consider what must be done when entering and exiting a scope, when processing a declaration, and when processing a use. To keep things simple, we will assume that each symbol-table entry includes only:
• the symbol name
• its type
• the nesting level of its declaration

### Method 1: List of Hashtables

The idea behind this approach is that the symbol table consists of a list of hashtables, one for each currently visible scope. When processing a scope S, the structure of the symbol table is:

For example, given this code:
void f(int a, int b) {
double x;
while (...) {
int x, y;
...
}

void g() {
f();
}

After processing the declarations inside the while loop, the symbol table looks like this:
The declaration of method g has not yet been processed, so it has no symbol-table entry yet. Note that because f is a method, its type includes the types of its parameters (int, int), and its return type (void).

Here are the operations that need to be performed on scope entry/exit, and to process a declaration/use:

1. On scope entry: increment the current level number and add a new empty hashtable to the front of the list.
2. To process a declaration of x: look up x in the first table in the list. If it is there, then issue a "multiply declared variable" error; otherwise, add x to the first table in the list.
3. To process a use of x: look up x starting in the first table in the list; if it is not there, then look up x in each successive table in the list. If it is not in any table then issue an "undeclared variable" error.
4. On scope exit, remove the first table from the list and decrement the current level number.
Remember that method names need to go into the hashtable for the outermost scope (not into the same table as the method's variables). For example, in the picture above, method name f is in the symbol table for the outermost scope; name f is not in the same scope as parameters a and b, and variable x. This is so that when the use of name f in method g is processed, the name is found in an enclosing scope's table.

There are several factors involved in the time required for each operation:

1. Scope entry: time to initialize a new, empty hashtable; this is probably proportional to the size of the hashtable.
2. Process a declaration: using hashing, constant expected time (O(1)).
3. Process a use: using hashing to do the lookup in each table in the list, the worst-case time is O(depth of nesting), when every table in the list must be examined.
4. Scope exit: time to remove a table from the list, which should be O(1) if garbage collection is ignored.

TEST YOURSELF #3

For all three questions below, assume that the symbol table is implemented using a list of hashtables.

Question 1: Recall that Java does not allow the same name to be used for a local variable of a method, and for another local variable declared inside a nested scope in the method body. Even with this restriction, it is not a good idea to put all of a method's local variables (whether they are declared at the beginning of the method, or in some nested scope within the method body) in the same table. Why not?

Question 2: C++ does not use exactly the scoping rules that we have been assuming. In particular, C++ does allow a function to have both a parameter and a local variable with the same name (and any uses of the name refer to the local variable).

Consider the following code. Draw the symbol table as it would be after processing the declarations in the body of f under:

• the scoping rules we have been assuming
• C++ scoping rules
void g(int x, int a) { }

void f(int x, int y, int z) {
int a, b, x;
...
}


Question 3: Assume that a symbol-table entry includes the "kind" of the declared name as well as the other attributes assumed above (if the same name is declared as two different "kinds" in one scope, there would be one entry with a list of "kinds"). Also assume that we can tell (from context), for each use of a name, what "kind" of name it is supposed to be.

Which of the four operations (scope entry, process a declaration, process a use, scope exit) described above would change (and how would it change) if Java rules for name reuse were used instead of C++ rules (i.e., if the same name can be used within one scope as long as the uses are for different kinds of names, and if the same name cannot be used for more than one variable declaration in nested scopes)?

solution

### Method 2: Hashtable of Lists

The idea behind this approach is that when processing a scope S, the structure of the symbol table is:

There is just one big hashtable, containing an entry for each variable for which there is some declaration in scope S or in a scope that encloses S. Associated with each variable is a list of symbol-table entries. The first list item corresponds to the most closely enclosing declaration; the other list items correspond to declarations in enclosing scopes.

For example, given this code:

void f(int a) {
double x;
while (...) {
int x, y;
...
}

void g() {
f();
}

After processing the declarations inside the while loop, the symbol table looks like this:

Note that the level-number attribute stored in each list item enables us to determine whether the most closely enclosing declaration was made in the current scope or in an enclosing scope.

Here are the operations that need to be performed on scope entry/exit, and to process a declaration/use:

1. On scope entry: increment the current level number.
2. To process a declaration of x: look up x in the symbol table. If x is there, fetch the level number from the first list item. If that level number = the current level then issue a "multiply declared variable" error; otherwise, add a new item to the front of the list with the appropriate type and the current level number.
3. To process a use of x: look up x in the symbol table. If it is not there, then issue an "undeclared variable" error.
4. On scope exit, scan all entries in the symbol table, looking at the first item on each list. If that item's level number = the current level number, then remove it from its list (and if the list becomes empty, remove the entire symbol-table entry). Finally, decrement the current level number.
The required times for each operation are:
1. Scope entry: time to increment the level number, O(1).
2. Process a declaration: using hashing, constant expected time (O(1)).
3. Process a use: using hashing, constant expected time (O(1)).
4. Scope exit: time proportional to the number of names in the symbol table (or perhaps even the size of the hashtable if no auxiliary information is maintained to allow iteration through the non-empty hashtable buckets).

TEST YOURSELF #4

Assume that the symbol table is implemented using a hashtable of lists. Draw pictures to show how the symbol table changes as the declarations in each scope in the following code is processed.

void g(int x, int a) {
double d;
while (...) {
int d, w;
double x, b;
if (...) {
int a,b,c;
}
}
while (...) {
int x,y,z;
}
}


solution

Contents

Topic IntroductionL

By using context-free grammars to define the syntax of our program, we have achieved two goals:

• We create a (relatively) concise description of our language for users. Programmers can look at the grammar and puzzle out the structure of programs in our language.
• We create a (relatively) precise specification for language implementation. Tools such as Bison can take the grammar as input and automatically generate parsers that ensure compliance with the grammar, and even build (implicit) parse trees.
Nevertheless, it is perhaps obvious that context-sensitive features of a language cannot be captured by a context-free grammar. At best, a context-free grammar can specify an over-approximation of valid programs. In other words, it is possible to write a program that parses, but it never the less invalid because it violates some context-sensitive property. For example, programs must be well-named (variables are declared before use in languages that require it) and well-typed (operations must be compatible with their data values).

There is a distinction made between the syntax of a program (does it have a valid structure in the language?) and the semantics of a program (does it have a valid meaning in the language?). Although this distinction can sometimes be messy, we will simply make the working assumption that any input that violates the context-free grammar specification of the language is syntactic, and any other error is semantic.

Program MeaningL

Program semantics is a subject worthy of a full course in its own right (often such courses have titles like "Programming Languages"). Semantic specifications tend to be more complicated than syntactic specifications. The inherent complexity of semantics require more elaborate notation, but we will stick to prose definitions of familiar semantic properties.

Semantic AnalysisL

The compiler performs semantic analysis for two purposes:

1. To detect "bad programs"; i.e. to detect if the input violates the language's semantics before it is run
2. To respect "good programs"; i.e. to make sure that the compiler's translation adheres to the semantic properties of the language
It is provably impossible to tell algorithmically if a program could violate even simple semantic properties (a result due to Rice's Theorem, effectively a generalization of the well-known Halting Problem). Despite these harsh realities, the compiler's gotta compile and so must stick to approximate algorithms to determine a program's semantic validity. Fortunately, most compiler-writers intuitively accept the approximate nature of semantic checking, since they themselves have experience dealing with these approximations as compiler users.

For our purposes, we will perform much of the detection of "bad programs" and the analysis of "good programs" by walking over the AST. The tree-based representation of a program is a good match for the compositional nature of many of the properties we'd like to check (The program is well typed if it's composite function code are well-typed. A function is well-typed if it's composite statements are well-typed. A statement is well typed if its composite expressions are well-typed). As such, the foundation of the checks that we will perform is a recursive walk of the AST: the correctness of each node will depend on the correctness of its children. We will also introduce mechanisms that help to propogate information across the AST, in order to handle the context-sensitive nature of semantic properties. One particularly important data structure is the symbol table, which we will discuss in detail in the next reading.

DATAFLOW ANALYSIS Adapted from a set of notes by Professor Susan Hortwitz, UW-Madison

# Motivation for Dataflow Analysis

A compiler can perform some optimizations based only on local information. For example, consider the following code:

x = a + b;
x = 5 * 2;

It is easy for an optimizer to recognize that:
• The first assignment to x is a "useless" assignment, since the value computed for x is never used (and thus the first statement can be eliminated from the program).
• The expression 5 * 2 can be computed at compile time, simplifying the second assignment statement to x = 10;

Some optimizations, however, require more "global" information. For example, consider the following code:

a = 1;
b = 2;
c = 3;
if (...) x = a + 5;
else x = b + 4;
c = x + 1;

In this example, the initial assignment to c (at line 3) is useless, and the expression x + 1 can be simplified to 7, but it is less obvious how a compiler can discover these facts since they cannot be discovered by looking only at one or two consecutive statements. A more global analysis is needed so that the compiler knows at each point in the program:
• which variables are guaranteed to have constant values, and
• which variables will be used before being redefined.
To discover these kinds of properties, we use dataflow analysis. Dataflow analysis is usually performed on the program's control-flow graph (CFG); the goal is to associate with each program component (each node of the CFG) information that is guaranteed to hold at that point on all executions.

# Examples of constant propagation and live-variable analysis

Below are examples illustrating two dataflow-analysis problems: constant propagation and live-variable analysis. Both examples use the following program:
1.   k = 2;
2.   if (...) {
3.     a = k + 2;
4.     x = 5;
5.   } else {
6.     a = k * 2;
7.     x = 8;
8.   }
9.   k = a;
10.  while (...) {
11.     b = 2;
12.     x = a + k;
13.     y = a * b;
14.     k++;
15.  }
16.  print(a+x);


Constant Propagation

The goal of constant propagation is to determine where in the program variables are guaranteed to have constant values. More specifically, the information computed for each CFG node n is a set of pairs, each of the form (variable, value). If we have the pair (x, v) at node n, that means that x is guaranteed to have value v whenever n is reached during program execution.

Below is the CFG for the example program, annotated with constant-propagation information.

Live-Variable Analysis

The goal of live-variable analysis is to determine which variables are "live" at each point in the program; a variable is live if its current value might be used before being overwritten. The information computed for each CFG node is the set of variables that are live immediately after that node. Below is the CFG for the example program, annotated with live variable information.

TEST YOURSELF #1

Draw the CFG for the following program, and annotate it with the constant-propagation and live-variable information that holds before each node.

N = 10;
k = 1;
prod = 1;
MAX = 9999;
while (k <= N) {
if (MAX/num < prod) {
print("cannot compute prod");
break;
}
prod = prod * num;
k++;
}
print(prod);


solution

# Defining a Dataflow Problem

Before thinking about how to define a dataflow problem, note that there are two kinds of problems:

• Forward problems (like constant propagation) where the information at a node n summarizes what can happen on paths from "enter" to n.
• Backward problems (like live-variable analysis), where the information at a node n summarizes what can happen on paths from n to "exit".
In what follows, we will assume that we're thinking about a forward problem unless otherwise specified.

Another way that many common dataflow problems can be categorized is as may problems or must problems. The solution to a "may" problem provides information about what may be true at each program point (e.g., for live-variables analysis, a variable is considered live after node n if its value may be used before being overwritten, while for constant propagation, the pair (x, v) holds before node n if x must have the value v at that point).

Now let's think about how to define a dataflow problem so that it's clear what the (best) solution should be. When we do dataflow analysis "by hand", we look at the CFG and think about:

• What information holds at the start of the program.
• When a node n has more than one incoming edge in the CFG, how to combine the incoming information (i.e., given the information that holds after each predecessor of n, how to combine that information to determine what holds before n).
• How the execution of each node changes the information.

This intuition leads to the following definition. An instance of a dataflow problem includes:

• a CFG,
• a domain D of "dataflow facts",
• a dataflow fact "init" (the information true at the start of the program for forward problems, or at the end of the program for backward problems),
• an operator ⌈⌉ (used to combine incoming information from multiple predecessors),
• for each CFG node n, a dataflow function fn : D → D (that defines the effect of executing n).

For constant propagation, an individual dataflow fact is a set of pairs of the form (var, val), so the domain of dataflow facts is the set of all such sets of pairs (the power set). For live-variable analysis, it is the power set of the set of variables in the program.

For both constant propagation and live-variable analysis, the "init" fact is the empty set (no variable starts with a constant value, and no variables are live at the end of the program).

For constant propagation, the combining operation ⌈⌉ is set intersection. This is because if a node n has two predecessors, p1 and p2, then variable x has value v before node n iff it has value v after both p1 and p2. For live-variable analysis, ⌈⌉ is set union: if a node n has two successors, s1 and s2, then the value of x after n may be used before being overwritten iff that holds either before s1 or before s2. In general, for "may" dataflow problems, ⌈⌉ will be some union-like operator, while it will be an intersection-like operator for "must" problems.

For constant propagation, the dataflow function associated with a CFG node that does not assign to any variable (e.g., a predicate) is the identity function. For a node n that assigns to a variable x, there are two possibilities:

1. The right-hand side has a variable that is not constant. In this case, the function result is the same as its input except that if variable x was constant the before n, it is not constant after n.
2. All right-hand-side variables have constant values. In this case, the right-hand side of the assignment is evaluated producing consant-value c, and the dataflow-function result is the same as its input except that it includes the pair (x, c) for variable x (and excludes the pair for x, if any, that was in the input).

For live-variable analysis, the dataflow function for each node n has the form: fn(S) = (S - KILLn) union GENn, where KILLn is the set of variables defined at node n, and GENn is the set of variables used at node n. In other words, for a node that does not assign to any variable, the variables that are live before n are those that are live after n plus those that are used at n; for a node that assigns to variable x, the variables that are live before n are those that are live after n except x, plus those that are used at n (including x if it is used at n as well as being defined there).

An equivalent way of formulating the dataflow functions for live-variable analysis is: fn(S) = (S intersect NOT-KILLn) union GENn, where NOT-KILLn is the set of variables not defined at node n. The advantage of this formulation is that it permits the dataflow facts to be represented using bit vectors, and the dataflow functions to be implemented using simple bit-vector operations (and or).

It turns out that a number of interesting dataflow problems have dataflow functions of this same form, where GENn and KILLn are sets whose definition depends only on n, and the combining operator ⌈⌉ is either union or intersection. These problems are called GEN/KILL problems, or bit-vector problems.

TEST YOURSELF #2

Consider using dataflow analysis to determine which variables might be used before being initialized; i.e., to determine, for each point in the program, for which variables there is a path to that point on which the variable was never defined. Define the "may-be-uninitialized" dataflow problem by specifying:

• the domain D of "dataflow facts",
• the operator ⌈⌉,
• the "init" dataflow fact,
• a dataflow function for each CFG node n.
Annotate the CFG for the example program from TEST YOURSELF #1 with the solution to the problem (the information that holds before each node).

If you did not define the may-be-uninitialized problem as a GEN/KILL problem, go back and do that now (i.e., say what the GEN and KILL sets should be for each kind of CFG node, and whether ⌈⌉ should be union or intersection).

solution

DATAFLOW ANALYSIS Adapted from a set of notes by Professor Susan Hortwitz, UW-Madison

## Contents

• Topic Introduction
• MOP Solution
• Solving a set of equations
• Iterative Algorithms
• The Lattice Model of Dataflow Analysis
Topic IntroductionL

Once a dataflow problem has been set up, we can focus on solving instances of the problem. A solution to an instance of a dataflow problem is a dataflow fact for each node of the given CFG. But what does it mean for a solution to be correct, and if there is more than one correct solution, how can we judge whether one is better than another?

Ideally, we would like the information at a node to reflect what might happen on all possible paths to that node. This ideal solution is called the meet over all paths (MOP) solution, and is discussed below. Unfortunately, it is not always possible to compute the MOP solution; we must sometimes settle for a solution that provides less precise information.

The "Meet Over All Paths" SolutionL

The MOP solution (for a forward problem) for each CFG node n is defined as follows:

• For every path "enter → ... → n", compute the dataflow fact induced by that path (by applying the dataflow functions associated with the nodes on the path to the initial dataflow fact).
• Combine the computed facts (using the combining operator, $\sqcap$ ).
• The result is the MOP solution for node n.

For instance, in our running example program there are two paths from the start of the program to line 9 (the assignment k = a):

Path
Constants associated w/ that path
1 → 2 → 3 → 4 → 9
k=2, a=4, x=5
1 → 2 → 6 → 7 → 9
k=2, a=4, x=8

Combining the information from both paths, we see that the MOP solution for node 9 is: k=2 and a=4.

It is worth noting that even the MOP solution can be overly conservative (i.e., may include too much information for a "may" problem, and too little information for a "must" problem), because not all paths in the CFG are executable. For example, a program may include a predicate that always evaluates to false (e.g., a programmer may include a test as a debugging device -- if the program is correct, then the test will always fail, but if the program contains an error then the test might succeed, reporting that error). Another way that non-executable paths can arise is when two predicates on the path are not independent (e.g., whenever the first evaluates to true then so does the second). These situations are illustrated below.

Unfortunately, since most programs include loops, they also have infinitely many paths, and thus it is not possible to compute the MOP solution to a dataflow problem by computing information for every path and combining that information. Fortunately, there are other ways to solve dataflow problems (given certain reasonable assumptions about the dataflow functions associated with the CFG nodes). As we shall see, if those functions are distributive, then the solution that we compute is identical to the MOP solution. If the functions are monotonic, then the solution may not be identical to the MOP solution, but is a conservative approximation.

Solving a Dataflow Problem by Solving a Set of EquationsL

The alternative to computing the MOP solution directly, is to solve a system of equations that essentially specify that local information must be consistent with the dataflow functions. In particular, we associate two dataflow facts with each node n:

1. n.before: the information that holds before n executes, and
2. n.after: the information that holds after n executes.
These n.befores and n.afters are the variables of our equations, which are defined as follows (two equations for each node n):
1. n.before = $\bigsqcap$(p1.after, p2.after, ...)
where p1, p2, etc are n's predecessors in the CFG (and $\sqcap$ is the combining operator for this dataflow problem).
2. n.after = fn ( n.before )
In addition, we have one equation for the enter node:
• enter.after = init (recall that "init" is part of the specification of a dataflow problem)
These equations make intuitive sense: the dataflow information that holds before node n executes is the combination of the information that holds after each of n's predecessors executes, and the information that holds after n executes is the result of applying n's dataflow function to the information that holds before n executes.

One question is whether, in general, our system of equations will have a unique solution. The answer is that, in the presence of loops, there may be multiple solutions. For example, consider the simple program whose CFG is given below:

The equations for constant propagation are as follows (where $\sqcap$ is the intersection combining operator):

enter.after = empty set
1.before = enter.after
1.after = 1.before - (x, *) union (x, 2)
2.before = 1.after
2.after = if (x, c) is in 2.before then 2.before - (y, *) union (y, c), else 2.before - (y, *)
3.before = $\bigsqcap$(2.after, 4.after )
3.after = 3.before
4.before = 3.after
4.after = 4.before
Because of the cycle in the example CFG, the equations for 3.before, 3.after, 4.before, and 4.after are mutually recursive, which leads to the four solutions shown below (differing on those four values).

 Variable Solution 1 Solution 2 Solution 3 Solution 4 1.before { } { } { } { } 1.after {(x, 2)} {(x, 2)} {(x, 2)} {(x, 2)} 2.before {(x, 2)} {(x, 2)} {(x, 2)} {(x, 2)} 2.after {(x, 2) (y, 2)} {(x, 2) (y, 2)} {(x, 2) (y, 2)} {(x, 2) (y, 2)} 3.before { } {(x, 2)} {(y, 2)} {(x, 2) (y, 2)} 3.after { } {(x, 2)} {(y, 2)} {(x, 2) (y, 2)} 4.before { } {(x, 2)} {(y, 2)} {(x, 2) (y, 2)} 4.after { } {(x, 2)} {(y, 2)} {(x, 2) (y, 2)}

The solution we want is solution 4, which includes the most constant information. In general, for a "must" problem the desired solution will be the largest one, while for a "may" problem the desired solution will be the smallest one.

Self-Study #1L

Using the simple CFG given above, write the equations for live-variable analysis, as well as the greatest and least solutions. Which is the desired solution, and why?

solution

Iterative AlgorithmsL

Many different algorithms have been designed for solving a dataflow problem's system of equations. For our purposes, we will focus on a type of algoirhtm called an interative algorithm.

Most of the iterative algorithms are variations on the following algorithm (this version is for forward problems). It uses a new value $\top$ (called "top"). $\top$ has the property that, for all dataflow facts $d, \top \sqcap d = d$. Also, for all dataflow functions, fn($\top$) = $\top$. (When we consider the lattice model for dataflow analysis we will see that this initial value is the top element of the lattice.)

• Step 1 (initialize n.afters):
Set enter.after = init. Set all other n.after to $\top$.
• Step 2 (initialize worklist):
Initialize a worklist to contain all CFG nodes except enter and exit.
• Step 3 (iterate):
While the worklist is not empty:
Remove a node n from the worklist.
Compute n.before by combining all p.after such that p is a predecessor of n in the CFG.
Compute tmp = fn ( n.before )
If (tmp != n.after) then
Set n.after = tmp
Put all of n's successors on the worklist

Self-Study #2L

Run this iterative algorithm on the simple CFG given above (the one with a loop) to solve the constant propagation problem. Run the algorithm again on the example CFG from the examples section of the notes.

The above algorithm works regardless of the order in which nodes are removed from the worklist. However, that order can affect the efficiency of the algorithm. A number of variations have been developed that involve different ways of choosing that order. When we consider the lattice model, we will revisit the question of complexity.

The Lattice Model of Dataflow AnalysisL

Recall that while we would like to compute the meet-over-all-paths (MOP) solution to a dataflow problem, direct computation of that solution (by computing and combining solution for every path) is usually not possible. Therefore, dataflow problems are usually solved by finding a solution to a set of equations that define two dataflow facts (n.before and n.after) for each CFG node n.

Three important questions are:

1. How do we know that a solution to the equations exists?
2. If there is more than one solution, which one do we want?
3. How does the equation solution relate to MOP solution?

The answers are provided by the framework first defined by Kildall. The next section provides background on lattices; the section after that presents Kildall's framework. BackgroundL

Let us begin by considering some mathematical definitions that will be useful in our discussion of lattices (including the defintiion of "lattice" itself).

#### Partially ordered setsL

Definition:

A partially ordered set (poset) is a set S that has a partial ordering ⊆ , such that the ordering is:
1. Reflexive: for all x in S, x ⊆ x
2. Anti-Symmetric: for all x,y in S, x ⊆ y and y ⊆ x implies x = y
3. Transitive: for all x,y,z in S, x ⊆ y and y ⊆ z implies x ⊆ z

Note: "partial" means it is not necessary that for all x,y in S, either x ⊆ y or y ⊆ x. It is OK for a pair of set elements to be incomparable.

Below are some examples of sets with orderings; some are partially ordered sets and some are not.

Example 1: The set S is the set of English words, and the ordering ⊆ is substring (i.e., w1 ⊆ w2 iff w1 is a substring of w2). Here is a picture of some words and their ordering (having an edge w1 → w2 means w1 > w2).

                         candy                  then
/    \                /    \
v     v              v      v
annual  and   can            the   hen
\   |    /                \   /
\  |   /                  v v
v v  v                    he
an
|
v
a


Note that the "substring" ordering does have the three properties required of a partial order:
1. It is reflexive (since every word is a substring of itself).
2. It is anti-symmetric (since if two words are substrings of each other, then they must be the same).
3. It is transitive (since a substring of a substring of a word is also a substring of the word).

Example 2: S is the set of English words, and the ordering ⊆ is "is shorter than or equal to in length".

                candy
|
v
____ then
/    /  |  \
v   v   v   v
can and the hen
\  \   /   /
\  \ /   /
v v v   v
an
|
v
a

Does this ordering have the three properties?
1. Reflexive: yes.
2. Anti-Symmetric: NO (Counter-example: "and" and "the" have the same length, but are not the same word.)
3. Transitive: yes.
Two out of three isn't good enough -- this is not a poset.

Example 3: S is the set of integers, and the ordering ⊆ is "less than or equal to". This is a poset (try verifying each of the three properties).

Example 4: S is the set of integers and the ordering ⊆ is "strictly less than". This is not a poset, because the ordering is not reflexive.

Example 5: S is the set of all sets of letters and the ordering is subset. This is a poset.

#### Lattices

Definition:

A lattice is a poset in which every pair of elements has:
1. a Least Upper Bound (the join of the two elements), and
2. a Greatest Lower Bound (the meet of the two elements).
The join of two elements x and y is defined to be the element z such that:
1. x ⊆ z, and
2. y ⊆ z, and
3. for all w such that x ⊆ w and y ⊆ w, w ⊇ z.
The first two rules say that z actually is an upper bound for x and y, while the third rule says that z is the least upper bound. Pictorially:
        z
/ \              z is the least upper bound of x and y
v   v
y   x

z  w
|\/|
|/\|             z is NOT the least upper bound of x and y
vv vv            (they have NO least upper bound)
y  z

The idea for the meet operation is similar, with the reverse orderings.
##### Examples:
• Example 1 above (English words, ordered by "substring") is not a lattice. (For instance, "a" and "he" have no meet.)
• Example 3 above (integers, ordered by "less than or equal to") is a lattice. The meet operator is "min" and the join operator is "max".
• Example 5 above (sets of letters, ordered by subset) is a lattice. The meet operator is intersection and the join operator is union.

#### Complete lattices

Definition:
A complete lattice is a lattice in which all subsets have a greatest lower bound and a least upper bound (the bounds must be in the lattice, but not necessarily in the subsets themselves).

Note: Every finite lattice (i.e., S is finite) is complete.

#### Examples:

• Example 3 above (integers, ordered by "less than or equal to") is not a complete lattice. The set of all positive numbers is a subset of S; it has a lower bound but no upper bound.
The set of all even numbers is a subset of S; it has neither an upper nor a lower bound.
• The set: {1, 1/2, 1/4, 1/8, ... 0} with the usual numeric ordering is an example of an infinite complete lattice.

Note: Every complete lattice has a greatest element, "Top" (written as a capital T) and a least element "Bottom" (written as an upside-down capital T). They are the least-upper and the greatest-lower bounds of the entire underlying set S.

#### Monotonic and distributive functions

Definition:
A function f: L → L (where L is a lattice) is monotonic iff for all x,y in L: x ⊆ y implies f(x) ⊆ f(y).
Definition:
A function f: L → L (where L is a lattice) is distributive iff for all x,y in L: f(x meet y) = f(x) meet f(y).

Every distributive function is also monotonic (proving that could be good practice!) but not vice versa. For the GEN/KILL dataflow problems, all dataflow functions are distributive. For constant propagation, all functions are monotonic, but not all functions are distributive. For example, the dataflow function f associated with this assignment

x = a + b
is not distributive. To see this, first recall that function f is defined as follows:
 f(S) = if S == T then T else if (a, v1) in S and (b, v2) in S then S - (x,*) + (x, v1+v2) else S - (x,*)
i.e., if f is applied to the special "top" value, then the result is T; otherwise, if both a and b are constant, then x is set to have the value of a+b (and its old value, if any, is removed from the set of pairs); otherwise, x's old value, if any, is removed from the set of pairs.
Now consider the following two sets:
S1 = { (a,1) (b,2) } S2 = { (a,2) (b,1) }
The meet of S1 and S2 is { } So f(S1 meet S1) is { }. However, if we apply f to S1 and S2 and then take the meet we get this:
f(S1) = {(a,1)(b,2)(x,3)}
f(S1) = {(a,2)(b,1)(x,3)}
f(S1) meet f(S2) = {(x,3)}

#### Fixed points

Definition:
x is a fixed point of function f iff f(x) = x.

#### Examples:

Let L be the lattice whose elements are sets of letters (as above). Here are the fixed points for the functions we considered above:
1. f(S) = S union {a}
This function has many fixed points: all sets that contain 'a'.
2. f(S) = if sizeof(S) ≤ 3 then S else empty-set
The fixed points for this function are all sets of size less than or equal to 3
3. f(S) = S - {a}
The fixed points for this function are all sets that do not contain 'a'.
As an example of a function that has no fixed point, consider the lattice of integers with the usual "less than or equal to" ordering. The function: f(x) = x+1 has no fixed point.

Here is an important theorem about lattices and monotonic functions:

Theorem:

If L is a complete lattice and f: L → L is monotonic, then:
• f has at least one fixed point.
• f's fixed points themselves form a (complete) lattice. (i.e. there exist greatest and least fixed points).
• If L has no infinite descending chains, then the greatest fixed point of f can be found by iterating:
f(T), f(f(T)), f(f(f(T))), ...
until a fixed point is found (i.e., two successive values are the same). (Note that "T" means L's top element.)
• Similarly, if L has no infinite ascending chains, then the least fixed point of f can be found by iterating
f(bottom), f(f(bottom)) ...
until a fixed point is found.

#### Examples

L is our usual lattice of sets of letters, with set union for the join.
1. f(S) = S U {a}
Recall that f is monotonic. The greatest fixed point of f is the set of all letters: {a, b, ..., z}. That is also the top element of the lattice, so we find the greatest fixed point in just one iteration:
f(T) = T
The bottom element is the empty set. If we apply f to that we get the set {a}. f({a}) = {a}, and we've found the least fixed point. Other fixed points are any set of letters that contains a.

2. f(S) = if size(S) ≤ 3 then S else {}
Recall that this function f is not monotonic. It has no greatest fixed point. It does have a least fixed point (the empty set). Other fixed points are sets of letters with size ≤ 3.

#### Creating new lattices from old ones

We can create new lattices from old ones using cross-product: if L1, L2, ..., Ln are lattices, then so is the cross-product of L1, L2, ..., Ln (which we can write as: L1 x L2 x ... x Ln). The elements of the cross-product are tuples of the form:

<e1, e2, ... , en>
such that value ek belongs to lattice Lk

The ordering is element-wise: <e1, e2, ..., en> ⊆ <e1', e2', ..., en'> iff:

e1 ⊆ e1' and
e2 ⊆ e2', and
..., and
en ⊆ en'

If L1, L2, ..., Ln are complete lattices, then so is their cross-product. The top element is the tuple that contains the top elements of the individual lattices: <top of L1, top of L2, ... , top of Ln>, and the bottom element is the tuple that contains the bottom elements of the individual lattices: <bottom of L1, bottom of L2, ... , bottom of Ln>.

#### Summary of lattice theory

• poset: a partially ordered set, includes a set of elements S and a partial ordering ⊆
• lattice: a poset with meet and join for every pair of elements
• complete lattice: a lattice with meet and join for all subsets of S
• monotonic functions: x ⊆ y implies f(x) ⊆ f(y)
• fixed points: x is a fixed point of function f (of type L→L) iff x = f(x).
If L is a complete lattice and f is monotonic, then f has a greatest and a least fixed point.
If L has no infinite descending chains then we can compute the greatest fixed point of f via iteration ( f(T), f(f(T)) etc.)

### Kildall's Lattice Framework for Dataflow Analysis

Recall that our informal definition of a dataflow problem included:

• a domain D of "dataflow facts",
• a dataflow fact "init" (the information true at the start of the program),
• an operator ⌈⌉ (used to combine incoming information from multiple predecessors),
• for each CFG node n, a dataflow function fn : D → D (that defines the effect of executing n)
and that our goal is to solve a given instance of the problem by computing "before" and "after" sets for each node of the control-flow graph. A problem is that, with no additional information about the domain D, the operator ⌈⌉ , and the dataflow functions fn, we can't say, in general, whether a particular algorithm for computing the before and after sets works correctly (e.g., does the algorithm always halt? does it compute the MOP solution? if not, how does the computed solution relate to the MOP solution?).

Kildall addressed this issue by putting some additional requirements on D, ⌈⌉ , and fn. In particular he required that:

1. D be a complete lattice L such that for any instance of the dataflow problem, L has no infinite descending chains.
2. ⌈⌉ be the lattice's meet operator.
3. All fn be distributive.
He also required (essentially) that the iterative algorithm initialize n.after (for all nodes n other than the enter node) to the lattice's "top" value. (Kildall's algorithm is slightly different from the iterative algorithm presented here, but computes the same result.)

Given these properties, Kildall showed that:

• The iterative algorithm always terminates.
• The computed solution is the MOP solution.
It is interesting to note that, while his theorems are correct, the example dataflow problem that he uses (constant propagation) does not satisfy his requirements; in particular, the dataflow functions for constant propagation are not distributive (though they are monotonic). This means that the solution computed by the iterative algorithm for constant propagation will not, in general be the MOP solution. Below is an example to illustrate this:
         1: enter
|
v
2: if (...)
/        \
v          v
3: a = 2     4: a = 3
|          |
v          v
5: b = 3     6: b = 2
\     /
v   v
7: x = a + b
|
v
8: print(x)

The MOP solution for the final print statement includes the pair (x,5), since x is assigned the value 5 on both paths to that statement. However, the greatest solution to the set of equations for this program (the result computed using the iterative algorithm) finds that x is not constant at the print statement. This is because the equations require that n.before be the meet of m.after for all predecessors m; in particular, they require that the "before" set for node 7 (x = a + b) has empty, since the "after" sets of the two predecessors have (a,2), (b,3), and (a,3), (b,2), respectively, and the intersection of those two sets is empty. Given that value for 7.before, the equations require that 7.after (and 8.before) say that x is not constant. We can only discover that x is constant after node 7 if both a and b are constant before node 7.

In 1977, a paper by Kam and Ullman (Acta Informatica 7, 1977) extended Kildall's results to show that, given monotonic dataflow functions:

• The solutions to the set of equations form a lattice.
• The solution computed by the iterative algorithm is the greatest solution (using the lattice ordering).
• If the functions are monotonic but not distributive, then the solution computed by the iterative algorithm may be less than the MOP solution (using the lattice ordering).

To show that the iterative algorithm computes the greatest solution to the set of equations, we can "transform" the set of equations into a single, monotonic function L → L (for a complete lattice L) as follows:

Consider the right-hand side of each equation to be a "mini-function". For example, for the two equations:

n3.before = n1.after meet n2.after
n3.after = f3( n3.before )
The two mini-functions, g11 and g12 are:
g11(a, b) = a meet b
g12(a) = f3( a )

Define the function that corresponds to all of the equations to be:

f( <n1.before,n1.after,n2.before,n2.after ...> ) = <g11(..),g12(...),g21(..), g22(...),...>
Where the (...)s are replaced with the appropriate arguments to those mini-functions. In other words, function f takes one argument that is a tuple of values. It returns a tuple of values, too. The returned tuple is computed by applying the mini-functions associated with each of the dataflow equations to the appropriate inputs (which are part of the tuple of values that is the argument to function f).

Note that every fixed point of f is a solution to the set of equations! We want the greatest solution. (i.e., the greatest fixed point) To guarantee that this solution exists we need to know that:

1. the type of f is L→L, where L is a complete lattice
2. f is monotonic

To show (1), note that the each individual value in the tuple is an element of a complete lattice. (That is required by Kildall's framework.) So since cross product (tupling) preserves completeness, the tuple itself is an element of a complete lattice.

To show (2), note that the mini-functions that define each n.after value are monotonic (since those are the dataflow functions, and we've required that they be monotonic). It is easy to show that the mini-functions that define each n.before value are monotonic, too.

For a node n with k predecessors, the equation is:

n.before = m1.after meet m2.after meet ... meet mk.after
and the corresponding mini-function is:
f(a1, a2, ..., ak) = a1 meet a2 meet ... meet ak
We can prove that these mini-functions are monotonic by induction on k.

base case k=1

f(a1) = a1

We must show that given: a ⊆ a', f(a) ⊆ f(a').

For this f, f(a) = a, and f(a') = a', so this f is monotonic.

base case k=2

f(a1, a2) = a1 meet a2

We must show that given: a1 ⊆ a1' and a2 ⊆ a2', f(a1, a2) ⊆ f(a1', a2').

• by the definition of meet we know that:
(a1 meet a2) ⊆ a1 , and
(a1 meet a2) ⊆ a2
• by the transitivity of ⊆ we know that:
(a1 meet a2) ⊆ a1 ⊆ a1' implies (a1 meet a2) ⊆ a1', and
(a1 meet a2) ⊆ a2 ⊆ a2' implies (a1 meet a2) ⊆ a2'
• so (a1 meet a2) is a lower bound for both a1' and for a2'
• since (a1' meet a2') is (by definition) the greatest lower bound of a1' and a2', it must be that all other lower bounds are less; in particular: (a1 meet a2) must be ⊆ (a1' meet a2').

Induction Step

Assume that for all k < n

a1 ⊆ a1' and a2 ⊆ a2' and ... and an-1 ⊆ a'n-1 =>
f(<a1, ..., an-1) ⊆ f(<a1',... a'n-1>)
Now we must show the same thing for k=n
• f( <a1, a2, ..., an >) = a1 meet a2 meet ... meet an = f( <a1, a2, ..., an-1 >) meet an
• let x = f(<a1,. .... an-1>)
• let x'= f(<a1',. .... an-1'>)
• the induction hypothesis says: x ⊆ x'
we need to show: x meet an ⊆ x' meet an'
• this is true by the base case k=2!

Given that all the mini-functions are monotonic, it is easy to show that f (the function that works on the tuples that represent the nodes' before and after sets) is monotonic; i.e., given two tuples:

t1 = <e1, e2, ..., en>, and
t2 = <e1', e2', ..., en'>
such that: t1 ⊆ t2, we must show f(t1) ⊆ f(t2). Recall that, for a cross-product lattice, the ordering is element-wise; thus, t1 ⊆ t2 means: ek ⊆ ek', for all k. We know that all of the mini-functions g are monotonic, so for all k, gk(ek) ⊆ gk(ek'). But since the ordering is element-wise, this is exactly what it means for f to be monotonic!

We now know:

• f is a monotonic function of type L→L
• L is a complete lattice with no infinite descending chains

Therefore:

• f has a greatest fixed point
• It can be computed using f(T), f(f(T)), ...
This is not quite what the iterative algorithm does, but it is not hard to see that it is equivalent to one that does just this: initialize all n.before and n.after to top, then on each iteration, compute all of the "mini-functions" (i.e., recompute n.before and n.after for all nodes) simultaneously, terminating when there is no change. The actual iterative algorithm presented here is an optimization in that it only recomputes n.before and n.after for a node n when the "after" value of some predecessor has changed.

# SUMMARY

Given:
• A CFG,
• a domain of "dataflow facts" (a complete lattice L),
• a monotonic dataflow function for each CFG node, and
• an initial dataflow fact (an element of L)
the goal of dataflow analysis is to compute a "dataflow fact" (an element of L) for each CFG node. Ideally, we want the MOP (meet over all paths) solution, in which the fact at node n is the combination of the facts induced by all paths to n. However, for CFGs with cycles, it is not possible to compute this solution directly.

Another approach to solving a dataflow problem is to solve a system of equations that relates the dataflow facts that hold before each node to the facts that hold after the node.

Kildall showed that if the dataflow functions are distributive, then the (original version of the) iterative algorithm always terminates, and always finds the MOP solution. Kam and Ullman later showed that if the dataflow functions are monotonic then the iterative algorithm always finds the greatest solution to the set of equations. They also showed that if the functions are monotonic but not distributive, then that solution is not always the same as the MOP solution. It is also true that the greatest solution to the system of equations is always an approximation to the MOP solution (i.e., may be lower in the lattice of solutions).

Contents

Topic IntroductionL

A fundamental consideration in the design and implementation of any programming language is to determine which operations are invalid: if the programmer can express what values are valid or acceptable, automated tools can be leveraged to ensure that values adhere to that notion of acceptability.

One of the key ways that programmers express what is acceptable in the language is through the use of types. While most programmers have an intuitive understanding of the concept of types, these notions need to be formalized in order to implement a programming language.

Type VocabularyL

In order to work with types, we will need a definition. As with many topics in compilers and programming languages, the wide variety of uses for the term "type" can make it difficult to arrive at a meaningful definition that is also broad enough to capture its practical use. For our purposes, we will define a type as a constraint upon the values that a particular program construct may take. For example, if we declare a variable v to be of type int32, we specify the constrains that the value is an integer that can be represented in 32-bits (and thus carries a maximum and minimum value defined by the integer representation). As another example, we might declare a variable u to be of type unsigned int32, which specifies the constraint that the value must be non-negative and thus raises the minimum value that u may hold compared to v. As a result, most implementations would use the extra bits that otherwise would be used to represent negative values in order to represent larger positive values, and thus u may take on larger values than v.

In order to consider the full generality of even the most popular programming languages, it is worthwhile to consider some of the things that are not requirements on types:

• Types need not be specified explicitly by the programmer. Some languages (such as assembly languages) do not attach a type to a "variable" at all. Instead, the operation carries with it a specifier for what a typing on the operands.
• The language may internally assign types to program constructs at compile time when the types are not explicitly identified by the programmer, which is known as type inference. Type inference relies on algorithmic definitions to determine to determine a possible assignment of types to program constructs (and will often reject a program if no possible typing can be derived).
• The language may not require that the types of constructs remain the same throughout the lifetime of the program's execution, or may not assign types until the program is run. Such languages are known as dynamically-typed languages, in contrast to the statically-typed languages that assign types at compile time.

Type RulesL

The type rules of a language define how to determine expression types, and what is considered to be an error. The type rules specify, for every operator (including assignment), what types the operands can have, and what is the type of the result. For example, both C++ and Java allow the addition of an int and a double, and the result is of type double. However, while C++ also allows a value of type double to be assigned to a variable of type int, Java considers that an error.

Types Self Study

List as many of the operators that can be used in a Java program as you can think of (don't forget to think about the logical and relational operators as well as the arithmetic ones). For each operator, say what types the operands may have, and what is the type of the result.

Type CheckingL

Once type rules are established in a language, the effect of those rules need to be respected throughout the program. The strictness of the type rules varies from language to language, so the amount of work that the compiler has to do to ensure adherence to the type rules varies. Determining violations of the type rules is referred to as type checking.

Determining Expression Types

Expressions are entities in a program that yield a value, and thus it can be said that expressions have a type (which is a shorthand for saying that the value yielded by the expression is of that type). Determining the type held by a particular variable or used in a (sub)expression is often simply called typing. A program for which there is no valid typing is referred to being ill-typed, and a program for which a valid typing is assigned is referred to as being well-typed.

The wide variety of languages, with their various type rules, mean that different techniques for typing. In a statically-typed language, types are determined at compile time. In a dynamically-typed language, the type of an expression is determined at runtime.

Static typing is typically associated with languages that are compiled to native code, because different machine code instructions need to be generated based on the operands of an expression. For example, many hardware architectures have distinct floating-point and integer division instructions. Thus the correct instruction depends on the types being final code is run. Although it is feasible to target native code with a dynamically-typed language, the compiler would need to output instruction sequences for each possible version of the operation that could reach the expression, as well as auxiliary instructions so the correct sequence is identified and executed.

One advantage of static typing is that ill-typed programs can be identified before the program is deployed. However, Rice's Theorem looms large over static typing: there will necessarily be an input program in which the typing cannot be accurately determined. Consequently, approximate algorithms are used for static typing. Approximations are manifested in a variety of ways.

Finding Type Errors

In addition to finding type errors caused by operators being applied to operands of the wrong type, the type checker must also find type errors having to do with expressions that, because of their context must be boolean, and type errors having to do with function calls. Examples of the first kind of error include:

• the condition of an if statement
• the condition of a while loop
• the termination condition part of a for loop
and examples of the second kind of error include:
• calling something that is not a method
• calling a method with the wrong number of arguments
• calling a method with arguments of the wrong types
A Simple Typing AlgorithmL

Some languages have comparatively complex typing algorithms, involving type inference in which the type of a variable is determined without an explicit directive from the programmer. For the purposes of this class, we will use a much simpler algorithm that relies on a simple recursive walk over the AST, leveraging information populated by name analysis (one pass that performs name analysis and type analysis is possible, but conceptually we will consider name analysis to take place first).

Since expressions are the only program constructs that yield types, only AST nodes that are expression subclasses need to be assigned a type. Although other nodes will need to facilitate the type analysis pass, they will simply ensure that any enclosed expressions are reached by the analysis. Assigning types is slightly different based on the variety of expression:

• Literals will have a type that can be immediately determined based on the literal's value. Some literal values, are admissable in exactly one type, and thus can be assigned to that type. For example, literal true is (obviously) boolean. A numeric literal may be assigned to the "simplest" type for which the value is a member. For example, a node for the literal 1 will be assigned integer type, rather than a floating-point type, whereas 1.0 will be a floating-point.
• Variable Identifiers will have a type that is determined via name analysis: whichever type specified in the declaration to which the identifier is bound will be the type of the identifier.
• Function Identifiers will have a type that is the return type of the function, which is explicit in the function signature in C-link languages. Additionally, the expression for each actual argument needs to be compatible with the matching formal argument type declared explicitly in the function signature.
• Arithmetic Expressions will have a type that is recursively defined by the types of the operand subexpressions. For instance, to evaluate a plus expression, recursively determine the type of the left operand subexpression, then recursively determine the type of the right operand. If both subexpression types are integers, the plus is integer addition and therefore yields an integer result type. If both subexpression types are floating-point, the plus is floating-point addition and therefore yields an integer result type. If the types are different, the type promotion rules come into play. For example, in C, if the left subexpression type is int and the right subexpression is float, then the int will be promoted to float and the plus will be floating-point addition. It is also worth noting that if the types are incompatible and cannot be made compatible via promotion, then the plus represents an ill-typed operation, and yields an error type.

## Contents

Introduction

In this set of notes we will consider:

• How storage is laid out at runtime.
• What information is stored for each method.
• What happens during method call and return for two different approaches to storage layout: static and stack allocation.

Storage Layout

There are many possible ways to organize memory. We will concentrate mainly on the standard Unix approach, illustrated below:

Usually, the stack is used to store one activation record for each currently active method, and the heap is used for dynamically allocated memory (i.e., memory allocated as a result of using the new operator). An activation record is a data structure used to hold information relevant to one method call. The exact structure of an activation record depends both on the language in use and on the particular implementation; a typical organization is shown in the following picture (the individual fields will be discussed in some detail below and in the next set of notes).
As mentioned above, activation records are usually stored on the stack. A new record is pushed onto the stack when a method is called, and is popped when the method returns. However, for some languages, activation records may be stored in the heap (this might be done, for example, in a concurrent language, in which method calls do not obey the last-in-first-out protocol of a stack) or in the static data area. We will briefly consider the latter approach, then look at the most common case of stack allocation. In both cases, we will consider what must be done when a method is called, when it starts executing, and when it returns.

## Static Allocation

Some old implementations of Fortran used this approach: there is no heap or stack, and all allocation records are in the static data area, one per method. This means that every time a method is called, its parameters and local variables are stored in the same locations (which are known at compile time). This approach has some advantages and disadvantages when compared with stack or heap allocation of activation records:

• + fast access to all names (e.g., no need to compute the address of a variable at runtime)
• + no overhead of stack/heap manipulation
• - no recursion
• - no dynamic allocation
Using this approach, when a method is called, the calling method:
• Copies each argument into the corresponding parameter's space in the called method's activation record (AR).
• May save some registers (in its own AR).
• Performs a "Jump & Link": Jump to the first instruction of the called method, and put the address of the next instruction after the call (the return address) into the special RA register (the "return address" register).
The called method:
• Copies the return address from RA into its AR's return-address field.
• May save some registers (in its AR).
• May initialize local data.
When the called method is ready to return, it:
• Restores the values of any registers that it saved.
• Jumps to the address that it saved in its AR's return-address field.
Back in the calling method, the code that follows that call does the following:
• Restores any registers that it saved.
• If the called method was non-void (returned a value), put the return value (which may be in a special register or in the AR of the called method) in the appropriate place. For example, if the code was x = f();, then the return value should be copied into variable x.

Assume that static allocation is used, and that each activation record contains local variables, parameters, the return address, and (for non-void methods) the return value. Trace the execution of the following code by filling in the appropriate fields of the activation records of the three methods. Also think about where the string literals would be stored.

1.  void error(String name, String msg) {
2.    System.out.println("ERROR in method " + name + ": " + msg);
3.  }
4.
5.  int summation(int max) {
6.    int sum = 1;
7.    for (int k=1; k<=max; k++) {
8.      sum += k;
9.    }
10.   return sum;
11. }
12.
13. void main() {
14.   int x = summation(3);
15.   if (x != 6) error("main", "bad value returned by summation");
16. }


solution

Stack Allocation

Stack allocation is used to implement most modern programming languages. The basic idea is that:

• Each time a method is called, a new AR (also called a stack frame) is pushed onto the stack.
• The AR is popped when the method returns.
• A register (SP for "stack pointer") points to the top of the stack.
• Another register (FP for "frame pointer") points to the start of the current method's AR.
When a method is called, the calling method:
• May save some registers (in its own AR).
• If the language allows nested methods, may set up the access link; this means pushing the appropriate value -- more on that in the next set of notes -- onto the stack.
• Pushes the parameters onto the stack (into space that will be part of the called method's AR).
• Does a "Jump & Link" -- jumps to the 1st instruction of the called method, and puts the address of the next instruction (the one after the call) into register RA.
The called method:
• Pushes the return address (from RA) onto the stack (into its AR's "return address" field).
• Pushes the old FP into its AR's "control link" field.
• Sets the FP to point to the bottom of its AR (to the "access link" field if there is one; otherwise, to the first parameter). The address of that field is computed as follows: SP + (sizes of params) + (size of "access link" field) + (size of "return address" field) + (size of "control link" field). All of these sizes are computed at compile time. (Note that values are added to the SP because we are assuming that "lower" on the stack means a higher address.)
• May save some registers (by pushing them onto the stack).
• Sets up the "local data" fields. This may involve pushing actual values if the locals are initialized as part of their declarations, or it may just involve subtracting their total size from the SP.
When the method returns, it:
• Restores the values of any saved registers.
• Restores the old stack pointer (SP = FP).
• Restores the old frame pointer (FP = saved FP, i.e., the value in the control-link field).

Activation Records

Consider the following code:

void f2(int y) {
f1(y);
}

void f1(int x) {
if (x > 0) f2(x-1);
}

main() {
int a = 1;
f(1);
}

The following pictures show the activation records on the stack at different points during the code's execution (only the control link, parameter, and local variable fields are shown).
1. When the program starts:
2. After main calls f1:
3. After f1 calls f2:
4. After f2 calls f1:
After this, f1 returns (and its AR is popped), then f2 returns, then the first call to f1 returns, then the whole program ends.

Assume that stack allocation is used. Trace the execution of the following code by filling in the local variables, parameters, and control link fields of the activation records (recall that dynamically allocated storage is stored in the heap, not on the stack).

1.  void init(int[] A, int len) {
2.    for (int k=1; k<len; k++) {
3.      A[k] = k;
4.    }
5.  }
6.
7.  void main() {
8.    int[] x = new int[3];
9.    init(x, 3);
10. }


solution

Contents

In this set of notes we will consider how three different kinds of variables: local, global, and non-local, are accessed at runtime. For the purposes of this discussion, we will define these three categories as follows:

1. Local variables: variables that are declared in the method that accesses them.
2. Global variables: variables that are not declared in any method, but are accessible in all methods. For example, the public static fields of a Java class can be used in any method; variables can be declared at the file level in a C or C++ program, and used in any function; variables can be declared in the outermost scope of a Pascal program and can be used in any procedure or function.
3. Non-local variables: we will use this term for two situations:
1. In languages that allow sub-programs to be nested, it refers to variables that are declared in one sub-program and used in a nested sub-program. This is allowed, for example, in Pascal.
2. In languages with dynamic scope, it refers to variables that are used in a method without being declared there (so the use corresponds to the declaration in the "most recently called, still active" method).
Our discussion will include information specific to the last programming assignment (the code generator); i.e., how to generate MIPS code to access these three kinds of variables.

Local variables (including parameters) are stored in the activation record of the method in which they are declared. They are accessed at runtime using an offset from the frame pointer (FP). Since we're assuming that "up" in the stack means a lower address, these offsets will be negative numbers. For example, given this code:

        void P(int x) {
int a, b;
...
}

and assuming that activation records do not include an access link field or space for saved registers, P's activation records will be organized as follows:
                              <--- Stack Pointer
|               |
|_______________|
b: |               |
|_______________|
a: |               |
|_______________|
|_______________|
|_______________|
x: |               |
|_______________|   <--- Frame Pointer

Assuming 4 bytes for each pointer and each integer, here are the offsets for each of P's locals:
• x: offset 0
• a: offset -12
• b: offset -16
The following MIPS code loads the values of a and b into registers t1 and t2, respectively:
lw $t1, -12($fp)  # load a
lw $t2, -16($fp)  # load b

To be able to generate this code at compile time (e.g., to process a statement like x = a + b), the offset of each local variable must be known by the compiler. Therefore, the offset of each local variable from the Frame Pointer should be stored as an attribute of the variable in the symbol table. This can be done as follows:
• Keep track of the current offset in a global variable (e.g., a static ASTnode field) or in the symboltable (i.e., add a new field to the SymbolTable class, with methods to set and get the field's value).
• When the symbol-table-building code starts processing a method, set the current offset to zero.
• For each parameter: add the name to the symboltable as usual, but also add the value of the current offset as an attribute, and update the current offset by subtracting the size of the parameter (which will depend on its type).
• After processing all parameters, subtract 8 from the current offset (to leave room for the return address and the control link).
• For each local variable, add the name to the symboltable with the current offset, and subtract the appropriate amount from the current offset.
Note that the "current offset" is only set to zero at the start of a method, not at the start of a nested block, so each variable declared (somewhere) in the method has its own, unique offset in that method's AR.

For MIPS code, each double variable requires 8 bytes, and its offset is in the "middle" of those bytes. For example, given this code:

        void P(double x) {
double a, b;
...
}

variable x will be stored in the bytes at offsets 0 to -7 from the Frame Pointer, and the offset that should be stored in x's symboltable entry (and used to access x at runtime) will be -4. Here are the offsets for all of P's locals:
• x: offset -4
• a: offset -20
• b: offset -28
Special "double" registers (e.g., F0, F2, F12) must be used for double variables at runtime, and there are special opcodes for operations on double values. The following MIPS code loads the values of a and b into registers F0 and F2, respectively:
l.d $f0, -20($fp)  # load a
l.d $f2, -28($fp)  # load b


Assume that both an address and an integer require 4 bytes of storage, and that a double requires 8 bytes. Also assume that each activation record includes parameters, a return address, a control link, and space for local variables as illustrated above. What are the offsets (in bytes) for each of the parameters and local variables in the following functions?

void P1(int x, int y) {
int a, b, c;
...
while (...) {
double a, w;
}
}

void P2() {
int x, y;
...
if (...) {
double a;
...
}
else {
int b, c;
...
}
}


solution

As noted above, global variables are stored in the static data area. Using MIPS code, each global is stored in space labeled with the name of the variable, and the variable is accessed at runtime using its name. (Since we will be using the SPIM simulator, and since SPIM has some reserved words that can also be used as Simple variable names, you will need to add an underscore to global variable names to prevent clashes with SPIM reserved words.)

For example, if the source code includes:

       // global variables
int g;
double d;

The following code would be generated to reserve space in the static data area for variables g and d:
                .data          # put the following in the static data area
.align 2       # align on a word boundary
_g:     .space 4       # set aside 4 bytes
.data
.align 2
_d:     .space 8

And the following code would be generated to load the value of variable g into register t0, and to load the value of variable d into register f0:
                lw    $t0, _g # load contents of g into t0 l.d$f0, _dd      # load contents of dd into f0

Non-Local Variables

Recall that we are using the term "non-local variable" to refer to two situations:

1. In statically-scoped languages that allow nested sub-programs, a variable can be declared in one sub-program and accessed in another, nested sub-program. Such variables are "non-local" in the nested sub-program that accesses them. These variables are stored in the activation record of the sub-program that declares them. When they are used as non-local variables, that activation record is found at runtime either using access links or a display (discussed below).
2. In dynamically-scoped languages, a variable can be declared in one method, and accessed in a called method. The variable is non-local in the method that accesses it. Two different approaches to supporting the use of non-local variables in dynamically-scoped languages are discussed below.
Note that in languages (like C and C++) that permit the same name to be used in nested blocks within a method, we might also use the term "non-local variable" to refer to a use of a variable in one block that was declared in an enclosing block. However, this is not an interesting case in terms of runtime access. All of a method's variables (regardless of which block they are declared in) are stored in the method's activation record, and are accessed using offsets from the frame pointer, as discussed above.

First, let's consider an example (Pascal) program that includes accesses to non-local variables (the nesting levels of the procedures are given for later reference):

   + program MAIN;
| var x: integer;
|
|	+ procedure P;
|   (2)  write(x);
|    +
|
|	+ procedure Q;
|    | var y: integer = x;
|    |
|	|    + procedure R;
|	|    |   x = x + 1;
|    |    |   y = y + x;
(1)  (2)  (3)  if y<6 call R;
|	|    |   call P
|    |    +
|    |
|	|    call R;
|	|    call P;
|    |    if x < 5 call Q;
|    +
|
|    x = 2;
|    call Q;
+


solution

Trace the execution of this program, drawing the activation records for the main program and for each called procedure (include only local variables and control links in the activation records). Each time a non-local variable is accessed, determine which activation record it is stored in. Notice that the relative nesting levels of the variable's declaration and its use does not tell you how far down the stack to look for the activation record that contains the non-local variable. In fact, this number changes for different calls (e.g., due to recursion).

solution

The idea behind the use of access links is as follows:

• Add a new field to each AR -- the access link field.
• If P is (lexically) nested inside Q, then at runtime, P's AR's access link will point to the access link field in the AR of the most recent activation of Q.
• Therefore, at runtime, access links will form a chain corresponding to the nesting of sub-programs. For each use of a non-local x:
• At compile time, use the "Level Number" attribute of x and the "Current Level Number" (of the sub-program that is accessing x) to determine how many links of the chain to follow at runtime.
• If P at level i uses variable x, declared at level j, follow i-j links, then use x's "Offset" attribute to find x's storage space inside the AR.
Here's a snapshot of the stack when the example program given above is running, after the first call from R to P (showing only the access links and the local variables in the ARs):
			____________
P	|          |------------+   <-- FP
+==========+            |
R	|          |----------+ |
+==========+          | |
R    	|          |------+   | |
+==========+      |   | |
y:|    9     |      |   | |
Q     +----------|      |   | |
|          |--+ <-+ <-+ |
+==========+  |         |
x:|    4     |  |         |
MAIN	|----------|  |         |
|   null   |<-+      <--+
|__________|

To access the value of x from procedure R, access links must be followed. The number of links that must be followed is:
(level # of R) - (level # of decl of x)
= 3 - 1 = 2
So the code for y = x in procedure R would be as shown below (Note: we assume that the access link field is the first field in the AR; i.e., is pointed to by the FP. We also assume that both variable x and variable y are at offset -12 in their ARs. This is because the access link is at offset 0, the return address is at offset -4, and the control link is at -8; since x and y are the first local variables in their respective program/procedure, they are each at offset -12 in their respective ARs.)
        move $t0, FP // no links followed yet; t0 holds ptr to R's access link field lw$t0, ($t0) // 1 link: t0 holds ptr to Q's AR's access link field +-> lw$t0, ($t0) // 2 links: t0 holds ptr to main's AR's access link field | lw$t0, -12($t0) // t0 holds value of x | sw$t0, -12(FP)  // y = x
|
This code would be repeated if we needed to follow more links.

How to set up access links:
• The link is set up by the calling procedure.
• How to set up the link depends on the relative nesting levels of the calling and called procedures.
There are two cases:
Case 1: The calling procedure's level is less than the called procedure's level (i.e., the called procedure is nested directly inside the calling procedure, because if it's not, it's invisible and it can't be called). In this case, the called procedure's access link should point to the access link field of the calling procedure's AR. This case occurs in the example when Q calls R.

In this case, the calling procedure sets up the called procedure's access link by pushing the value of the FP just before the call, since the FP points to its access link. For example, when Q is calling R, here is a picture of the stack just after R's access link has been set up by Q:

                                             <-- SP
|----------|
|	   |------+
|==========|      |
y:|          |      |
Q     |----------|      |
|          |--+ <-+  <-- FP
|==========|  |
x:|          |  |
MAIN	|----------|  |
|   null   |<-+
|__________|

Case 2: The calling procedure's level is greater than or equal to the called procedure's level. In this case, the called procedure's access link should point to an AR that is already on the stack. The value for the new access link is found by starting with the value of the calling procedure's access link field (which is pointed to by the FP), and following X-Y access links, where:
X = calling procedure's level
Y = called procedure's level
The following code loads the value of the calling procedure's access link field into register t0:
lw    $t0, 0(FP)  If X == Y (i.e., no links should be followed to find the value of the new access link), then the value of t0 should simply be pushed onto the stack. If X is greater than Y, then the code: lw$t0, 0(\$t0)

should be generated X - Y times, before pushing the value of t0.

To illustrate this situation, consider two cases from the example code: R calls P, and Q calls itself. Recall that the nesting structure of the example program is:

      +--
|
|   +--
|   |
|  P|
|   |
|   +--
|
MAIN  |   +--
|   |
|   |     +--
|   |	|
|  Q|    R|
|   |	|
|   |	+--
|   |
|   +--
|
+--

When R is about to call P, the stack looks like this:
			+==========+
R	|          |----------+
+==========+          |
R    	|          |------+   |
+==========+      |   |
y:|    9     |      |   |
Q     +----------|      |   |
|          |--+ <-+ <-+
+==========+  |
x:|    4     |  |
MAIN	|----------|  |
|   null   |<-+
|__________|

Since P is nested inside MAIN, P's access link should point to MAIN's AR (i.e., the bottom AR on the stack). R's nesting level is 3 and P's nesting level is 2. Therefore, we start with the value of R's access link (the pointer to Q's AR) and follow one link. This retrieves the values of Q's access link, which is (as desired) a pointer to MAIN's AR. This is the value that will be pushed onto the stack (copied into P's AR as its access link).

When Q is about to call itself, the stack looks like this:

			|==========|
y:|          |
Q     |----------|
|          |--+
|==========|  |
x:|          |  |
MAIN	|----------|  |
|   null   |<-+
|__________|

The access link in the new AR for Q should be the same as the access link in the current AR for Q; namely, it should be a pointer to the AR for MAIN. This value is found by starting with the value of Q's access link (a pointer to MAIN's AR) and following zero links (because X = Y = 1).

Trace the execution of the example program again. Each time a procedure is called, determine which case applies in terms of how to set up the called procedure's access link. Then use the appropriate algorithm to find the value of the new access link, and draw the new AR with its access link.

Each time a non-local variable x is accessed, make sure that you find its activation record by following i - j access links (where i is the nesting level of the procedure that uses x and j is the nesting level of the procedure that declares x).

solution

Follow i - j links to find the AR with space for non-local x, where i is the nesting level of the procedure that uses x and j is the nesting level of the procedure that declares x.
To set up an access link:
• case #1: Calling procedure's level is less than called procedure's level: Push the value of the FP to be the access link field of the called procedure's AR.
• case #2: Calling procedure's level is greater than or equal to the called procedure's level: Find the value to push by starting with the value of the calling procedure's access links, then following X-Y links, where X = calling procedure's level, and Y = called procedure's level.

The motivation for using a display is to avoid the runtime overhead of following multiple access links to find the activation record that contains a non-local variable. The idea is to maintain a global "array" called the display. Ideally, the display is actually implemented using registers (one for each array element) rather than an actual array; however, it can also be implemented using an array in the static data area.

The size of the display is the maximum nesting depth of a procedure in the program (which is known at compile time). The display is used as follows:

• When procedure P at nesting level k is executing, DISPLAY[0],...DISPLAY[k-2] hold pointers to the ARs of the most recent activations of the k-1 procedures that lexically enclose P. DISPLAY[k-1] holds a pointer to P's AR.
• To access a non-local variable declared at level x, use DISPLAY[x-1] to get to the AR that holds x, then use the usual offset to get x itself.
To illustrate this, refer back to our running example program, outlined below:
      +--
|int x;
|
|   +--
|   |
|  P|...x...
|   |
|   +--
|
MAIN  |   +--
|   | int y;
|	  |
|   |     +--
|   |	|...x...y...
|  Q|    R|call R
|   |	|call P
|   |	+--
|   |
|   | call R
|   | call P
|   | if (...) call Q
|   +--
|
| call Q
+--

Below are two pictures comparing the use of access links with the use of a display. Both show the same moment at runtime.
        USING ACCESS LINKS                  USING A DISPLAY

____________                          ----------
P     |          |------------+	   P   |        |<--+
|==========|            |	       |========|   |   +--+
R     |          |----------+ |	   R   |        |<------|  | [2]
|==========|          | |             |========|   |   +--+
y:|          |          | |           y:|        |   +---|  | [1]
Q     |----------|          | |         Q   |--------|       +--+
|          |------+ <-+ |             |        |   +---|  | [0]
|==========|      |     |             |========|   |   +--+
x:|          |      |     |           x:|        |   | DISPLAY
MAIN    |----------|      |     |       MAIN  |--------|   |
|   null   |    <-+  <--+             |        |<--+
|__________|                          ----------

STACK                                STACK



To maintain the display, a new field (called the "save-display" field) is added to each activation record. The display is maintained as follows:

• When a procedure at nesting level k is called:
• The current value of DISPLAY[k-1] is saved in the "save-display" field of the new AR.
• DISPLAY[k-1] is set to point to (the save-display field of) the new AR.
• When the procedure returns, DISPLAY[k-1] is restored using the value saved in the "save-display" field (of the returning procedure).
This process is illustrate below, showing how the display and the save-display fields are updated when R calls P (only the local variable and save-display fields of the ARs are shown).
        Before R calls P:
----------------

________                  ____________
[2] |      |--\            R  |          |
|______|   -------------->|          |
[1] |      |--\               |==========|
|______|   \            y:|          |
[0] |      |--\ \          Q  |----------|
|______|   \ ------------>|          |
\             |==========|
\          x:|          |
\     MAIN  |----------|
---------->|          |
|==========|

After R calls P:
---------------

________
[2] |      |--\               ------------
|______|   \          P   |          |---+
[1] |      |----\------------>|          |   |
|______|     \            |==========|   |
[0] |      |--\   \       R   |          |   |
|______|   \   ---------->|          |   |
\             |==========|   |
\          x:|          |   |
\       Q   |----------|   |
\          |          |<--+
\         |==========|
\      y:|          |
\  MAIN |----------|
------>|          |
|----------|


Trace the execution of the running example program, assuming that a display is used instead of access links. Each time a non-local variable x is accessed, make sure that you understand how to find its AR using the display.

• Access links can require more time (at runtime) to access non-locals (especially when the non-local is many nesting levels away).
• It can also require more time to set up a new access link when a procedure is called (if the nesting level of the called procedure is much smaller than the nesting level of the calling procedure).
• Displays require more space (at runtime).
• If the compiler is flexible enough to implement the display using registers if enough are available, or using space in the static data area, then the compiler code itself may be rather complicated (and error-prone).
• In practice, sub-programs usually are not very deeply nested, so the runtime considerations may not be very important.

Recall that under dynamic scoping, a use of a non-local variable corresponds to the declaration in the "most recently called, still active" method. So the question of which non-local variable to use can't be determined at compile time. It can only be determined at run-time. There are two ways to implement access to non-locals under dynamic scope: "deep access" and "shallow access", described below.

Using this approach, given a use of a non-local variable, control links are used to search back in the stack for the most recent AR that contains space for that variable. Note that this requires that it be possible to tell which variables are stored in each AR; this is more natural for languages that are interpreted rather than being compiled (which was indeed the case for languages that used dynamic scope). Note also that the number of control links that must be followed cannot be determined at compile time; in fact, a different number of links may be followed at different times during execution, as illustrated by the following example program:

void P() { write x; }

void Q() {
x = x + 1;
if (x < 23) Q();
else P();
}

void R() {
int x = 20;
Q();
P():
}

void main() {
int x = 10;
R();
P();
}


Trace the execution of the program given above. Note that method P includes a use of non-local variable x. How many control links must be followed to find the AR with space for x each time P is called?

Using this approach, space is allocated (in registers or in the static data area) for every variable name that is in the program (i.e., one space for variable x even if there are several declarations of x in different methods). For every reference to x, the generated code refers to the same location.

When a method is called, it saves, in its own AR, the current values of all of the variables that it declares itself (i.e., if it declares x and y, then it saves the values of x and y that are currently in the space for x and y). It restores those values when it finishes.

Note that this means that when a method accesses a non-local variable x, the value of x from the most-recently-called, still-active method is stored in the (single) location for x. There is no need to go searching down the stack!

Question 1: Trace the execution of the program given above again, this time assuming that shallow access is used.

Question 2: What are the advantages and disadvantages of shallow access compared with deep access? (Consider both time and space requirements.)

## Contents

Overview

In a Java program, all parameters are passed by value. However, there are three other parameter-passing modes that have been used in programming languages:

1. pass by reference
2. pass by value-result (also called copy-restore)
3. pass by name
We will consider each of those modes, both from the point of view of the programmer and from the point of view of the compiler writer.

First, here's some useful terminology:

1. Given a method header, e.g.:
       void f(int a, boolean b, int c)

we will use the terms parameters, formal parameters, or just formals to refer to a, b, and c.

2. Given a method call, e.g.:
       f(x, x==y, 6);

we will use the terms arguments, actual parameters, or just actuals to refer to x, x==y, and 6.

3. The term r-value refers to the value of an expression. So for example, assuming that variable x has been initialized to 2, and variable y has been initialized to 3:
 expression r-value x 2 y 3 x+y 6 x==y false

4. The term l-value refers to the location or address of an expression. For example, the l-value of a global variable is the location in the static data area where it is stored. The l-value of a local variable is the location on the stack where it is (currently) stored. Expressions like x+y and x==y have no l-value. However, it is not true that only identifiers have l-values; for example, if A is an array, the expression A[x+y] has both an r-value (the value stored in the x+yth element of the array), and an l-value (the address of that element).
L-values and r-values get their names from the Left and Right sides of an assignment statement. For example, the code generated for the statement x = y uses the l-value of x (the left-hand side of the assignment) and the r-value of y (the right-hand side of the assignment). Every expression has an r-value. An expression has an l-value iff it can be used on the left-hand side of an assignment.

Value Parameters

Parameters can only be passed by value in Java and in C. In Pascal, a parameter is passed by value unless the corresponding formal has the keyword var; similarly, in C++, a parameter is passed by value unless the corresponding formal has the symbol & in front of its name. For example, in the Pascal and C++ code below, parameter x is passed by value, but not parameter y:

// Pascal procedure header
Procedure f(x: integer; var y: integer);

void f(int x; int & y);

When a parameter is passed by value, the calling method copies the r-value of the argument into the called method's AR. Since the called method only has access to the copy, changing a formal parameter (in the called method) has no effect on the corresponding argument. Of course, if the argument is a pointer, then changing the "thing pointed to" does have an effect that can be "seen" in the calling procedure. For example, in Java, arrays are really pointers, so if an array is passed as an argument to a method, the called method can change the contents of the array, but not the array variable itself, as illustrated below:
void f( int[] A ) {
A[0] = 10;   // change an element of parameter A
A = null;    // change A itself (but not the corresponding actual)
}

void g() {
int[] B = new int [3];
B[0] = 5;
f(B);
//*** B is not null here, because B was passed by value
//*** however, B[0] is now 10, because method f changed the first element
//*** of the array pointed to by B
}


TEST YOURSELF #1

What is printed when the following Java program executes and why?

class Person {
int age;
String name;
}

class Test {
static void changePerson(Person P) {
P.age = 10;
P = new Person();
P.name = "Joe";
}

public static void main(String[] args) {
Person P = new Person();
P.age = 2;
P.name = "Ann";
changePerson(P);
System.out.println(P.age);
System.out.println(P.name);
}
}


solution

Reference Parameters

When a parameter is passed by reference, the calling method copies the l-value of the argument into the called method's AR (i.e., it copies a pointer to the argument instead of copying the argument's value). Each time the formal is used, the pointer is followed. If the formal is used as an r-value (e.g., its value is printed, or assigned to another variable), the value is fetched from the location pointed to by the pointer. If the formal is assigned a new value, that new value is written into the location pointed to by the pointer (the new value is not written into the called method's AR).

If an argument passed by reference has no l-value (e.g., it is an expression like x+y), the compiler may consider this an error (that is what happens in Pascal, and is also done by some C++ compilers), or it may give a warning, then generate code to evaluate the expression, to store the result in some temporary location, and to copy the address of that location into the called method's AR (this is is done by some C++ compilers).

In terms of language design, it seems like a good idea to consider this kind of situation an error. Here's an example of code in which an expression with no l-value is used as an argument that is passed by reference (the example was actually a Fortran program, but Java-like syntax is used here):

        void mistake(int x) {  // x is a reference parameter
x = x+1;
}

void main() {
int a;
mistake(1);
a = 1;
print(a);
}

When this program was compiled and executed, the output was 2! That was because the Fortran compiler stored 1 as a literal at some address and used that address for all the literal "1"s in the program. In particular, that address was passed when "mistake" was called, and was also used to fetch the value to be assigned into variable a. When "mistake" incremented its parameter, the location that held the value 1 was incremented; therefore, when the assignment to a was executed, the location no longer held a 1, and so a was initialized to 2!

To understand why reference parameters are useful, remember that, although in Java all non-primitive types are really pointers, that is not true in other languages. For example, consider the following C++ code:

class Person {
public:
String name;
int age;
};

void birthday(Person per) {
per.age++;
}

void main() {
Person P;
P.age = 0;
birthday(P);
print(P.age);
}

Note that in main, variable P is a Person, not a pointer to a Person; i.e., main's activation record has space for P.name and P.age. Parameter per is passed by value (there is no ampersand), so when birthday is called from main, a copy of variable P is made (i.e., the values of its name and age fields are copied into birthday's AR). It is only the copy of the age field that is updated by birthday, so when the print statement in main is executed, the value that is output is 0.

This motivates some reasons for using reference parameters:

1. When the job of the called method is to modify the parameter (e.g., to update the fields of a class), the parameter must be passed by reference so that the actual parameter, not just a copy, is updated.
2. When the called method will not modify the parameter, and the parameter is very large, it would be time-consuming to copy the parameter; it is better to pass the parameter by reference so that a single pointer can be passed.

TEST YOURSELF #2

Consider writing a method to sort the values in an array of integers. An operation that is used by many sorting algorithms is to swap the values in two array elements. This might be accomplished using a swap method:

static void swap(int x, int y) {
int tmp = x;
x = y;
y = tmp;
}

Assume that A is an array of 4 integers. Draw two pictures to illustrate what happens when the call:
swap(A[0], A[1]);

is executed, first assuming that this is Java code (all parameters are passed by value), and then assuming that this is some other language in which parameters are passed by reference. In both cases, assume that the array itself is stored in the heap (i.e., the space for A in the calling method's AR holds a pointer to the space allocated for the array in the heap). Your pictures should show the ARs of the calling method and method swap.

solution

It is important to realize that the code generator will generate different code for a use of a parameter in a method, depending on whether it is passed by value or by reference. If it is passed by value, then it is in the called method's AR (accessed using an offset from the FP). However, if it is passed by reference, then it is in some other storage (another method's AR, or in the static data area). The value in the called method's AR is the address of that other location.

To make this more concrete, assume the following code:

void f(int a) {
a = a - 5;
}

void main() {
int x = 10;
f(a);
}

Below is the code that would be generated for the statement a = a - 5, assuming (1) that a is passed by value and (2) assuming that a is passed by reference:

Passed by Value

Passed by Reference

TEST YOURSELF #3

The code generator will also generate different code for a method call depending on whether the arguments are to be passed by value or by reference. Consider the following code:

    int x, y;
x = y = 3;
f(x, y);

Assume that f's first parameter is passed by reference, and that its second parameter is passed by value. What code would be generated to fill in the parameter fields of f's AR?

Value-Result Parameters

Value-result parameter passing was used in Fortran IV and in Ada. The idea is that, as for pass-by-value, the value (not the address) of the actual parameters are copied into the called method's AR. However, when the called method ends, the final values of the parameters are copied back into the arguments. Value-result is equivalent to call-by-reference except when there is aliasing (note: "equivalent" means the program will produce the same results, not that the same code will be generated).

Two expressions that have the same l-value are called aliases. Aliasing can happen:

• via pointer manipulation,
• when a parameter is passed by reference and is also global to the called method,
• when a parameter is passed by reference using the same expression as an argument more than once; e.g., call test(x,y,x).
Will will look at examples of each of these below.

### Creating Aliases via Pointers

Pointer manipulation can create aliases, as illustrated by the following Java code. (Note: this kind of aliasing does not make pass-by-reference different from pass-by-value-result; it is included here only for completeness of the discussion of aliasing.)

Person p, q;
p = new Person();
q = p;
// now p.name and q.name are aliases (they both refer to the same location)
// however, p and q are not aliases (they refer to different locations)


Pictorially:

### Creating Aliases by Passing Globals as Arguments

This way of creating aliases (and the difference between reference parameters and value-result parameters in the presence of this kind of aliasing) are illustrated by the following C++ code:

int x = 1;      // a global variable

void f(int & a)
{ a = 2;        // when f is called from main, a and x are aliases
x = 0;
}

main()
{ f(x);
cout << x;
}

As stated above, passing parameters by value-result yields the same results as passing parameters by reference except when there is aliasing. The above code will print different values when f's parameter is passed by reference than when it is passed by value-result. To understand why, look at the following pictures, which show the effect of the code on the activation records (only variables and parameters are shown in the ARs, and we assume that variable x is in the static data area):
 Call-by-reference Call-by-value At time of call image/svg+xml x: main's AR a: f 's AR Static DataArea 1 image/svg+xml x: main's AR a: f 's AR Static DataArea 1 1 After a = 2 image/svg+xml x: main's AR a: f 's AR Static DataArea 2 image/svg+xml x: main's AR a: f 's AR Static DataArea 1 2 After x = 0 image/svg+xml x: main's AR a: f 's AR Static DataArea 0 image/svg+xml x: main's AR a: f 's AR Static DataArea 0 2 After call When f returns the final value of value-result parameter a is copied back into the space for x, so: image/svg+xml x: main's AR