Trial 3
Due on October 2nd&nbs-05:00;11:59&nbs-05:00;PM (Not accepted late).
Updates

None yet!

Overview

For this trial, you will create a parser-generator named dragonparse, which will take a BNF-like grammar and produce a corresponding SLR parser.

Like Trial 1, you may use whatever coding language you like, subject to the following constraints:

  • Your code must build and run on the cycle servers under a non-privileged user account
  • Your program must be self-contained (no code in other user directories)
  • Your work must be substantially your own (no cheating)

Note that for this trial, the generated parser only checks for language membership, it does not do syntax-directed translation. Your programs should adhere to formats given below for the grammar file (given to the parser-generator), and the tokens file (given to the generated parser).

Deliverables

You should turn in a tarred directory (similar to a Trial 1 submission) containing the following:

  • The source files for your parser-generator
  • A test grammar on which to run the parser-generator, called input.grammar
  • A test token stream on which to run the generator parser, called test.tokens
  • A Makefile with the following three targets:
    • build: This target will build the dragonparse executable from source code (you may use a language that does not require compilation, but make build should still succeed as a no-op.)
    • run: This target will run the dragonparse executable on a file named input.grammar containing a grammar spec. If the grammar is not SLR, dragonparse should print "Bad grammar" and terminate with a non-zero exit code. If the grammar is SLR, dragonparse should generate a parser executable named parser capable of recognizing token streams in the language specified by input.grammar.

      If your dragonparse executable produces source code (like bison), then this target should also compile that source code into the parser executable.

    • test: This target will run the parser executable on a file named test.tokens representing a token stream in the language. If the token stream is a valid string in the language, parser should print "accepted" to the output stream and terminate with exit code 0. If the token stream is not a valid string in the language, parser should print "rejected" to the output stream and terminate with exit code 1.

Grammar Format

Your dragonparse program should take an input grammar in BNF-like format, as follows:

Production rules

The grammar will consist of a set of production rules, which may take one of two forms:

  1. A single LHS non-terminal symbol, then the literal characters ::=, then a RHS symbol string, then a terminating newline character (\n)
  2. The literal character |, then a RHS symbol string, then a terminating newline character

If the second form is used, then the LHS symbol should be assumed to be the last LHS nonterminal specified in the grammar. The first form may be used multiple times to specify distinct rules with the same LHS symbol. The second form may only appear after the first form. The first LHS symbol should be considered the start symbol.

Symbols

Symbols (both terminals and nonterminals) are sequences of one or more alphanumeric characters, delimited by the space character. The set of nonterminal symbols are differentiated by appearing on the LHS of the grammar - any symbol that does not appear on the LHS of some production rule is considered to be a terminal.

Symbol Strings

Symbol strings consist of zero or more space-delimited symbols. If a symbol string has zero symbols, it specifies an ε-production rule

Example

A ::= B
   |
B ::= C
B ::= hello
   | w o rld
		

The set of non-terminals of this grammar is {A,B} (since those are the only symbols to appear on the LHS of the production rules). The set of terminals of this grammar is {C,hello,w,o,rld}. The nonterminal A may produce a B or produce the empty string (via production 2). The nonterminal B may produce the symbol hello, the single symbol C, or the three terminals w,o,rld.

The above format is designed for simplicity. While your adherence to this spec is greatly appreciated in grading, parsing the input is not really the focus of this assisgnment. You should not focus on rejecting malformed input grammars or on gotchas in the grammar spec.

Token Stream Format

The parsers generated by dragonparse should take an input file formatted as a stream of terminal tokens. The intention is to match the output of a lexer generated by the code in Trial 1. The format is as follows:

Token Streams

The token stream consists of zero or more token lines. There is no EOF token line, so it is recommended that you insert an EOF pseudotoken after the last token line has been read.

Token Lines

Each line of the token stream has one of two forms:

  1. A terminal symbol, followed by a space literal, followed by a position, followed by a newline character literal.
  2. A terminal symbol, followed by a colon literal, followed by a lexeme, followed by a space literal, followed by a position, followed by a newline character literal.

Tokens

Tokens are alphanumeric sequences of characters, and correspond to the terminals of the grammar. Our test token streams will only include tokens that are terminals of the language.

Lexemes

Lexemes correspond to the actual characters matched by the token. Note that the lexeme may include a space character (most likely as part of a string).

Positions

Positions correspond to the line and column at which a lexeme was matched. Positions consist of a left bracket, followed by a line number, followed by a column number, followed by a right bracket.

Advice
This section contains some tips for this trial, and is not required reading.

Handling ε-Production Rules

Although not a major focus of our examples, LR parsers can handle (some) grammars with production rules that produce ε. For the most part, ε behaves like a terminal, with the one exception that the item with the marker before the ε is considered to be the same item as the one with the marker after that ε for a given rule. For example,

L ::= ε only yields the one item: L ::= ε · , which is considered to be the same item as L ::= · ε .

This ε-item counts as being a item with the dot at the end of the production, thus being eligible for a reduce action.

The notation used above is particularly helpful in this regard. Consider the first production from this grammar (which is equivalent to the example above):
A ::= 
   |  B
B ::= C
B ::= hello
   | w o rld
		
should yield only the one item A ::= · and no others.