Due on
October 7 |
Due on
October 7 |
None yet!
For this trial, you will create a parser-generator
named dragonparse
, which will
take a BNF-like grammar and produce a corresponding SLR
parser.
Like Trial 1, you may use whatever coding language you like, subject to the following constraints:
Note that for this trial, the generated parser only checks for language membership, it does not do syntax-directed translation. Your programs should adhere to formats given below for the grammar file (given to the parser-generator), and the tokens file (given to the generated parser).
You should turn in a tarred directory (similar to a Trial 1 submission) containing the following:
make build
should still succeed as a no-op.)
run: This target will run the
dragonparse executable on a file named
input.grammar containing a grammar spec.
If the grammar is not SLR, dragonparse should print "Bad grammar" and terminate with a non-zero
exit code. If the grammar is SLR,
dragonparse should generate a parser
executable named parser
capable of recognizing token streams in the language
specified by input.grammar.
If your
dragonparse executable produces source
code (like bison), then this target should also compile
that source code into the parser
executable.
test: This target will run the
parser
executable on a file named
test.tokens
representing a token stream in the
language. If the token stream is a valid string in the
language, parser
should print "accepted"
to the output stream and terminate with exit code 0.
If the token stream is not a valid string in the language,
parser
should print "rejected" to the output
stream and terminate with exit code 1.
Your dragonparse
program should take
an input grammar in BNF-like format, as follows:
The grammar will consist of a set of production rules, which may take one of two forms:
If the second form is used, then the LHS symbol should be assumed to be the last LHS nonterminal specified in the grammar. The first form may be used multiple times to specify distinct rules with the same LHS symbol. The second form may only appear after the first form. The first LHS symbol should be considered the start symbol.
Symbols (both terminals and nonterminals) are sequences of one or more alphanumeric characters, delimited by the space character. The set of nonterminal symbols are differentiated by appearing on the LHS of the grammar - any symbol that does not appear on the LHS of some production rule is considered to be a terminal.
Symbol strings consist of zero or more space-delimited symbols. If a symbol string has zero symbols, it specifies an ε-production rule
A ::= B | B ::= C B ::= hello | w o rld
The set of non-terminals of this grammar is {A,B} (since those are the only symbols to appear on the LHS of the production rules). The set of terminals of this grammar is {C,hello,w,o,rld}. The nonterminal A may produce a B or produce the empty string (via production 2). The nonterminal B may produce the symbol hello, the single symbol C, or the three terminals w,o,rld.
The above format is designed for simplicity. While your adherence to this spec is greatly appreciated in grading, parsing the input is not really the focus of this assisgnment. You should not focus on rejecting malformed input grammars or on gotchas in the grammar spec.
The parsers generated by dragonparse should take an input file formatted as a stream of terminal tokens. The intention is to match the output of a lexer generated by the code in Trial 1. The format is as follows:
The token stream consists of zero or more token lines. There is no EOF token line, so it is recommended that you insert an EOF pseudotoken after the last token line has been read.
Each line of the token stream has one of two forms:
Tokens are alphanumeric sequences of characters, and correspond to the terminals of the grammar. Our test token streams will only include tokens that are terminals of the language.
Lexemes correspond to the actual characters matched by the token. Note that the lexeme may include a space character (most likely as part of a string).
Positions correspond to the line and column at which a lexeme was matched. Positions consist of a left bracket, followed by a line number, followed by a column number, followed by a right bracket.
Although not a major focus of our examples, LR parsers can handle (some) grammars with production rules that produce ε. For the most part, ε behaves like a terminal, with the one exception that the item with the marker before the ε is considered to be the same item as the one with the marker after that ε for a given rule. For example,
L ::= ε
only yields the one item:
L ::= ε ·
,
which is considered to be the same item as
L ::= · ε
.
This ε-item counts as being a item with the dot at the end of the production, thus being eligible for a reduce action.
The notation used above is particularly helpful in this regard. Consider the first production from this grammar (which is equivalent to the example above):A ::= | B B ::= C B ::= hello | w o rldshould yield only the one item
A ::= ·
and no others.