 Overview
 Ambiguous Grammars
 Disambiguating Grammars
 List Grammars
 A Grammar for a Programming Language
 Summary
 more than one leftmost derivation of $w$ or,
 more than one rightmost derivation of $w$, or
 more than one parse tree for $w$
 A grammar is recursive in nonterminal $X$ if: $$X \stackrel{+}\longrightarrow{\Tiny\dots\,}X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that includes an $X$).
 A grammar is left recursive in $X$ if: $$X \stackrel{+}\longrightarrow X{\Tiny\,\dots}$$ (in one or more steps, $X$ derives a sequence of symbols that starts with an $X$).
 A grammar is right recursive in $X$ if: $$X \stackrel{+}\longrightarrow {\Tiny\dots\,}X$$ (in one or more steps, $X$ derives a sequence of symbols that ends with an $X$).
 For left associativity, use left recursion.
 For right associativity, use right recursion.
 One or more PLUSes (without any separator or terminator).
(Remember, any of the following three grammars defines
this language; you don't need all three lines).
 xList $\longrightarrow$ PLUS  xList xList
 xList $\longrightarrow$ PLUS  xList PLUS
 xList $\longrightarrow$ PLUS  PLUS xList
 One or more runs of one or more PLUSes, each run separated by commas:
 xList $\longrightarrow$ PLUS  xList COMMA xList
 xList $\longrightarrow$ PLUS  xList COMMA PLUS
 xList $\longrightarrow$ PLUS  PLUS COMMA xList
 One or more PLUSes, each PLUS terminated by a semicolon:
 xList $\longrightarrow$ PLUS SEMICOLON  xList xList
 xList $\longrightarrow$ PLUS SEMICOLON  xList PLUS SEMICOLON
 xList $\longrightarrow$ PLUS SEMICOLON  PLUS SEMICOLON xList
 Zero or more PLUSes (without any separator or terminator):
 xList $\longrightarrow$ $\varepsilon$  PLUS  xList xList
 xList $\longrightarrow$ $\varepsilon$  PLUS  xList PLUS
 xList $\longrightarrow$ $\varepsilon$  PLUS  PLUS xList
 Zero or more PLUSes, each PLUS terminated by a semicolon:
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  xList xList
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  xList PLUS SEMICOLON
 xList $\longrightarrow$ $\varepsilon$  PLUS SEMICOLON  PLUS SEMICOLON xList

The trickiest kind of list is a list of
zero or more x's, separated by commas.
To get it right, think of the definition as follows:

Either an empty list, or a nonempty list of x's separated
by commas.
xList $\longrightarrow$ $\varepsilon$  nonemptyList nonemptyList $\longrightarrow$ PLUS  PLUS COMMA nonemptyList  A program must be a list of functions (and nothing else); no statements may appear outside of a function (similar to C), nor may any variable declarations appear outside of a function (unlike C, which allows global variable declarations outside of functions).
 A program must contain at least one function, as the FunctionList cannot immediately derive $\varepsilon$.
 terminal symbol
 nonterminal symbol
 grammar rule (or production)
 derivation (leftmost derivation, rightmost derivation)
 parse (or derivation) tree
 the language defined by a grammar
 ambiguous grammar
Syntax for Programming Languages
Contents
While our primary focus in this course is compiler development, we will necessarily touch on some of the aspects of programming language development in the process. A programming language needs to define what constitutes valid syntax for the language (i.e. language definition). The compiler needs to determine whether a given stream of tokens belongs[1] During parsing, we're only interested in capturing the syntax (as opposed to the semantics) of the programming language. The syntax of a programming language is intuitively understood to be the structure of the code, whereas sematics is the meaning of the code. to that language (i.e. language recognition). Conveniently, a sufficiently rigorous definition of the language can be used as the basis of a recognizer: if there is some way to produce a given string according to the definition of the language, then it is necessarily part of the language.
Contextfree Grammars (CFGs) provide a promising basis for defining the syntax of a programming language: For one, the class of languages that CFGs recognize (the contextfree languages) are expressive enough to capture popular syntactic constructs. At the same time, CFGs are (relatively) uniform and compact. As such, programmers are expected to be able to read a CFGlike definition of a language and recognize whether or not a given string of input tokens belongs that language.
For a compiler, however, performing language recognition (making a yes/no determination of whether a string belongs to a language) is not enough to complete syntactic analysis. In order for a contextfree grammar to serve as a suitable definition for a programming language, there are a number of refinements that need to be considered. In this section, we will discuss some of these considerations.
We'll caveat the claim that "CFGs are a good way to define programminglanguage syntax" at the end of this reading, but it's worth noting that there are some compelling arguments to the contrary for practical purposes.
The following grammar (recalled from the prior) defines a simple language of mathematical expressions (with the productions numbered for reference):
P_{2}: Exp $\longrightarrow$ Exp minus Exp
P_{3}: Exp $\longrightarrow$ Exp divide Exp
P_{4}: Exp $\longrightarrow$ lparen Exp rparen
Consider the string
1  4 / 2
which would be tokenized as
It should be readily apparent to you that the string is recognized by the the language, and can be produced via the (leftmost) derivation sequence:
Step #2. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ (via production P_{1})
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus $Exp$ divide $Exp$ (via production P_{3})
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide $Exp$ (via production P_{1})
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit (via production P_{1})
This derivation sequence corresponds to the parse tree:
However, inspection of the grammar also reveals that there is another leftmost derivation that can recognize the string by choosing to apply the rules differently:
Step #1. Exp $\Longrightarrow$ Exp divide Exp (via production P_{3})
Step #2. $\;\;\;\;\;\Longrightarrow$ Exp minus Exp divide Exp (via production P_{2})
Step #3. $\;\;\;\;\;\Longrightarrow$ intlit minus Exp divide Exp (via production P_{1})
Step #4. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide Exp (via production P_{1})
Step #5. $\;\;\;\;\;\Longrightarrow$ intlit minus intlit divide intlit
(via production P_{1})
Which corresponds to the parse tree:
If for grammar $G$ and string $w$ there is:
then G is called an ambiguous grammar. (Note: the three conditions given above are equivalent; if one is true then all three are true.)
It is worth noting that ambiguity is not a problem for certain applications of ContextFree Grammars. As discussed in the intro, language recognition is satisfied by finding a one derivation for the string, and the string will certainly maintain membership in the language if there is more than one derivation possible. However, for the task of parsing, ambiguous grammars cause problems, both in the correctness of the parse tree and in the efficiency of building it. We will briefly explore both of these issues.
A problem of correctness: The major issue with ambiguous parse trees is that the underlying structure of the language is illdefined. In the above example, subtraction and division can swap precedence; the first parse tree groups 4/2, while the second groups 14. These two groupings correspond to expressions with different values: the first tree corresponds to (1(4/2)) and the second tree corresponds to ((14)/2).
A problem of efficiency: As we will see later, a parser can use significantly more efficient algorithms (in terms of BigO complexity) by adding constraints to the language that it recognizes. One such assumption of some parsing algorithms is that the language recognized is unambiguous. To some extent, this issue can come up independently of correctness. For example, the following grammar has no issue of grouping (it only ever produces intlit), but production P_{1} can be applied any number of times before producing the integer literal token.
P_{2}: Exp $\longrightarrow$ intlit
It's useful to know how to translate an ambiguous grammar into an equivalent unambiguous grammar Since every programming language includes expressions, we will discuss these operations in the context of an expression language. We are particularly interesting in forcing unambiguous precedence and associativies of the operators.
To write a grammar whose parse trees express precedence correctly, use a different nonterminal for each precedence level. Start by writing a rule for the operator(s) with the lowest precedence ("" in our case), then write a rule for the operator(s) with the next lowest precedence, etc:
P_{2}: Exp $\longrightarrow$ Term
P_{3}: Term $\longrightarrow$ Term divide Term
P_{4}: Term $\longrightarrow$ Factor
P_{5}: Factor $\longrightarrow$ intlit
P_{6}: Factor $\longrightarrow$ lparen Exp rparen
Now let's try using these new rules to build parse trees for
1  4 / 2
.
First, a parse tree that correctly reflects that fact that division has higher precedence than subtraction:
Now we'll try to construct a parse tree that shows the wrong precedence:
This grammar captures operator precedence, but it is still ambiguous! Parse trees using this grammar may not correctly express the fact that both subtraction and division are left associative; e.g., the expression: 532 is equivalent to: ((53)2) and not to: (5(32)).
Draw two parse trees for the expression 532 using the current expression grammar:

exp → exp MINUS exp  term
term → term DIVIDE term  factor
factor → INTLITERAL  LPAREN exp RPAREN
To understand how to write expression grammars that correctly reflect the associativity of the operators, you need to understand about recursion in grammars.
The grammar given above for arithmetic expressions is both left and right recursive in nonterminals Exp and Term (try writing the derivation steps that show this).
To write a grammar that correctly expresses operator associativity:
Here's the correct grammar:
Term $\longrightarrow$ Term divide Factor  Factor
Factor $\longrightarrow$ intlit  lparen Exp rparen
And here's the (one and only) parse tree that can be built for 5  3  2 using this grammar:
Now let's consider a more complete expression grammar, for arithmetic expressions with addition, multiplication, and exponentiation, as well as subtraction and division. We'll use the token pow for the exponentiation operator, and we'll use "**" as the corresponding lexeme; e.g., "two to the third power" would be written: 2 ** 3, and the corresponding sequence of tokens would be: intlit pow intlit.
Here's an ambiguous contextfree grammar for this language:
Exp  $\longrightarrow$  Exp plus Exp  Exp minus Exp  Exp times Exp  Exp divide Exp 
  Exp pow Exp  lparen Exp rparen  intlit 
First, we'll modify the grammar so that parse trees correctly reflect the fact that addition and subtraction have the same, lowest precedence; multiplication and division have the same, middle precedence; and exponentiation has the highest precedence:
exp  →  exp PLUS exp   exp MINUS exp   term 
term  →  term TIMES term   term DIVIDE term   factor 
factor  →  factor POW factor   exponent  
exponent  →  INTLITERAL   LPAREN exp RPAREN 
This grammar is still ambiguous; it does not yet reflect the associativities of the operators. So next we'll modify the grammar so that parse trees correctly reflect the fact that all of the operators except exponentiation are left associative (and exponentiation is right associative; e.g., 2**3**4 is equivalent to: 2**(3**4)):
exp  → exp PLUS term   exp MINUS term   term 
term  → term TIMES factor   term DIVIDE factor   factor 
factor  → exponent POW factor   exponent  
exponent  → INTLITERAL   LPAREN exp RPAREN 
Finally, we'll modify the grammar by adding a unary operator, unary minus, which has the highest precedence of all (e.g., 3**4 is equivalent to: (3)**4, not to (3**4). Note that the notion of associativity does not apply to unary operators, since associativity only comes into play in an expression of the form: x op y op z.
exp  → exp PLUS term   exp MINUS term   term 
term  → term TIMES factor   term DIVIDE factor   factor 
factor  → exponent POW factor   exponent  
exponent  → MINUS exponent   final  
final  → INTLITERAL   LPAREN exp RPAREN 
Below is the grammar we used earlier for the language of boolean expressions, with two possible operands: true and false, and three possible operators: and, or, not:
BExp$\;\longrightarrow\;$false
BExp $\;\longrightarrow\;$ BExp or BExp
BExp $\;\longrightarrow\;$ BExp and BExp
BExp $\;\longrightarrow\;$ not BExp
BExp $\;\longrightarrow\;$ lparen BExp rparen
Question 1: Add nonterminals so that or has lowest precedence, then and, then not. Then change the grammar to reflect the fact that both and and or are left associative.
Question 2:
Draw a parse tree (using your final grammar for Question 1)
for the expression: true and not true
.
Another kind of grammar that you will often need to write is a grammar that defines a list of something. There are several common forms. For each form given below, we provide three different grammars that define the specified list language.
A Grammar for a Programming Language
Using CFGs for programming language syntax has the notational advantage that the problem can be broken down into pieces. For example, let us begin sketching out a simple scripting language, that starts with productions for some toplevel nonterminals:
Program  $\longrightarrow$  FunctionList 
FunctionList  $\longrightarrow$  Function  Function FunctionList 
With these three productions, the basic shape of the program is already taking form, and some important constraints are put in place:
We can now move on to specifying the syntax of a function: A Function is the word "function", optionally preceded by the word "public", followed by an identifier (signifying the name of the function), followed by an open curly brace, followed by the function body, followed by a closing curly brace:
Function  $\longrightarrow$  public function id lcurly FunctionBody rcurly 
  function id lcurly FunctionBody rcurly 
Note that the above definition is hardly unique. Consider this alternative, but equivalent, definition for a function:
Function  $\longrightarrow$  Access function id lcurly FunctionBody rcurly 
Access  $\longrightarrow$  public  $\varepsilon$ 
Note that the syntax of the language does not provide a list of arguments that the function takes (many popular programming languages expect that a function has a set of parenthesis before the function body).
A function body is a list of zero or more statements:
FunctionBody  $\longrightarrow$  $\varepsilon$ 
  StatementList  
StatementList  $\longrightarrow$  Statement 
  Statement StatementList 
From here, we can define various statement types, and so on.
A benefit of the ContextFree Grammar is that a "subgrammar" can be plugged in to allow complex parse trees to be grown at the nonterminal at which they are rooted. For example, we could allow the rule:
Statement  $\longrightarrow$  id equals Exp 
(where equals is the symbol =, used here for assignment)
This grammar can then make use of the expression grammar that we defined above to build parse trees with arbitrarily complex mathematical expressions on the right hand side of an assignment statement.
SummaryTo understand how a parser works, we start by understanding contextfree grammars, which are used to define the language recognized by the parser. Important terminology includes:
Two common kinds of grammars are grammars for expression languages, and grammar for lists. It is important to know how to write a grammar for an expression language that expresses operator precedence and associativity. It is also important to know how to write grammars for both nonempty and possibly empty lists, and for lists both with and without separators and terminators.