EECS665 – Compiler Construction

Parsing

Overview

Given a CFG and a string (a sequence of tokens), the goal of parsing is to determine whether the string is in the language of the CFG. The answer is yes iff the string can be derived from the CFG's start nonterminal. Equivalently, the answer is yes iff we can build a parse tree for the string, with the CFG's start nonterminal at the root.

There are a number of algorithms that can answer this question for any CFG, even ambiguous CFGs. Their worst-case running time is O(N³), where N is the length of the input (the number of tokens in the input string). Some algorithms have better running times for non-ambiguous CFGs. For example, there is an algorithm called Earley's algorithm that has worst-case runtime O(N²) for non-ambiguous grammars, but even a quadratic-time algorithm may considered too slow.

Fortunately, there are classes of grammars for which O(n) parsers can be built (and given a grammar, we can quickly test whether it is in such a class). Two such classes are:

and

LALR(1) grammars are:

More general than LL(1) grammars (every LL(1) grammar is also LALR(1) but not vice versa).
The class of grammars accepted by the parser generators yacc, bison, and Java Cup.
Parsed by bottom-up parsers (they construct the derivation tree by going from terminals up to nonterminals, then attaching several nonterminals together to form a new nonterminal, etc, ending with just the start nonterminal).
Harder to understand than LL(1) parsers (i.e., it is hard to understand how the LALR(1) parser works, and what makes a grammar LALR(1)).

Sample Parsers

In this class, we will learn about three parsing techniques:

A general, bottom-up parsing algorithm called the CYK algorithm.
A technique for parsing LL(1) grammars top-down, called predictive parsing.
A technique for bottom-up parsing that works for LALR(1) grammars, but since that is rather complicated, we will study a version of the algorithm that only works for a subset of the LALR(1) grammars called SLR(1) grammars.