Due on
September 16 |
Due on
September 16 |
None yet!
In Project 1, you used a scanner-generator (e.g., Flex). In this assignment, you will create a scanner-generator. Your scanner-generator should work much like Flex, though it will use a decidedly stripped down format.
The primary artifact produced in this assignment is an executable,
which you should call dragonlex
, which will take
an input file (a lexical specification) and produce a corresponding
lexer.
Additionally, you'll
need to submit another couple of files that help to evaluate
your code.
The exact design of the
dragonlex
-produced lexer is your choice. However, in
order to make evaluating your project reasonable, you should
submit your code alongside a makefile with the following targets:
make build
can
be a no-op.
lexer
. Note that
your dragonlex program might (as Flex does) output source code. If you
choose that behavior, then the run target should not only run dragonlex
but also compile the output source code (along with driver code you distribute)
into the runnable program named lexer
.
lexer
on a file named test.txt
and outputs the token stream to
test.tokens
.
Note that your targets should work on the cycle machines without installation of extra code.
The lexer produced by dragonlex
will be far more
constrained than Flex, that will strictly produce a token stream
of a particularly simple design. Each token should appear on its
own line, where each line is the name of the token-match, followed
by a space, followed by the line-match, as follows:
SLASH
.
The token-match for INTLIT (which does need to keep its lexeme)
is INTLIT:12
when it matches the character string
12
.
The input to dragonlex
will much simpler than that
consumed by Flex. The spec should be line-based, where each line
is of one of the form:
<regex> <action>Where
<regex>
is a regular expression, using
the operators/operands described below.
<action>
is an action that will be fired
when the corresponding regular expression is matched. Action may
be:
true
or false
.
When true, outputting the token should be suffixed with the
lexeme matched by the regular expression (as detailed in the
output specification section above).
Note that your targets should work on the cycle servers without installation of extra code.
Assume the input spec is
dog NOUN true bites VERB false m(a)+n NOUN false [ ] (SKIP) . (ERR) "bad input"
When run on this spec, dragonlex
should produce a lexer.
When that lexer is run on the input
maaan bites dog
then the lexer should output
NOUN [1,1] VERB [1,7] NOUN:dog [1,13]
Unlike Flex, dragonlex does not need to support macros. However, you should support the following operators:
Assume that only the printable ascii tokens exist in the input file.
Your dragonlex
program should generate its lexer
using the techniques described in class: transform each regular
expression in the spec to it's expression tree, build a finite
state machine from the expression tree, then build a table-based
tokenizer that yields the longest-match based token stream.
If more than one match of the same length is found, then the highest regular expression in the spec should be used.
All of the code required to generate and operate the lexer should be included in a directory named t1. Create a gzipped tarball with the t1 directory in it and submit it to the t1 folder in Blackboard.
You should consider a pre-processing phase to rewrite each regex into a simpler version. For example, you can rewrite a bracketed selection operator into a set of alternations. You may also rewrite a negated selection operator into a positive selection operator.