Trial 1
Due on September 16th 11:59 PM (Not accepted late).
Updates

None yet!

Overview

In Project 1, you used a scanner-generator (e.g., Flex). In this assignment, you will create a scanner-generator. Your scanner-generator should work much like Flex, though it will use a decidedly stripped down format.

Deliverables

The primary artifact produced in this assignment is an executable, which you should call dragonlex, which will take an input file (a lexical specification) and produce a corresponding lexer. Additionally, you'll need to submit another couple of files that help to evaluate your code. The exact design of the dragonlex-produced lexer is your choice. However, in order to make evaluating your project reasonable, you should submit your code alongside a makefile with the following targets:

  • build: This target will build the dragonlex executable from source code (you are free to use a language that does not require compilation, in which case make build can be a no-op.
  • run: This target will run the dragonlex executable on a file named A.spec, where input.spec is a sample input specification that corresponds to the lexical specification of the A language. This target should generate a scanner executable named lexer. Note that your dragonlex program might (as Flex does) output source code. If you choose that behavior, then the run target should not only run dragonlex but also compile the output source code (along with driver code you distribute) into the runnable program named lexer.
  • test: This target tests out the scanner named lexer on a file named test.txt and outputs the token stream to test.tokens.
  • clean: This target should remove any code generated by dragonlex or any other makefile target.

Note that your targets should work on the cycle machines without installation of extra code.

Output Specification

The lexer produced by dragonlex will be far more constrained than Flex, that will strictly produce a token stream of a particularly simple design. Each token should appear on its own line, where each line is the name of the token-match, followed by a space, followed by the line-match, as follows:

  • The token-match is the name of the token and, if the spec indicates that the lexeme should be kept, is suffixed by a colon and then the lexeme. If the spec indicates that the lexeme should not be kept, the token-match is just the name of the token. For example, the token-match for the SLASH token (which does not need to keep its lexeme), is just SLASH. The token-match for INTLIT (which does need to keep its lexeme) is INTLIT:12 when it matches the character string 12.
  • The line-match indicates where in the input stream the token started. In particular, it should be a left-bracket, followed by the line, followed by a comma, followed by the column, followed by a right-bracket.
Finally, the end of the spec should output EOF [<L>,<C<] where L is the end of file line and C is the end of file column. Essentially, the output from the scanner output from dragonlex should match the output from project 1.

Input Specification

The input to dragonlex will much simpler than that consumed by Flex. The spec should be line-based, where each line is of one of the form:

			<regex> <action>
			
Where
  • <regex> is a regular expression, using the operators/operands described below.
  • <action> is an action that will be fired when the corresponding regular expression is matched. Action may be:
    • (SKIP)
      When the regex is matched, do nothing. This can be used for discarding whitespace.
    • (ERR) <string> When the regex is matched, output the string to stderr and do not return a token.
    • <token> <keep> When the regex is matched, output the given token to the output file. The token may be any alphanumeric string. The keep value may be either true or false. When true, outputting the token should be suffixed with the lexeme matched by the regular expression (as detailed in the output specification section above).

Note that your targets should work on the cycle servers without installation of extra code.

Example

Assume the input spec is

      dog NOUN true
      bites VERB false
      m(a)+n NOUN false
      [ ] (SKIP)
      . (ERR) "bad input"
			

When run on this spec, dragonlex should produce a lexer. When that lexer is run on the input

      maaan bites dog
      

then the lexer should output

      NOUN [1,1]
      VERB [1,7]
      NOUN:dog [1,13]
      

Regular Expression Format

Unlike Flex, dragonlex does not need to support macros. However, you should support the following operators:

  • Parentheses for grouping
  • . for any character except for newline
  • \t for tab
  • \" for double-quote
  • \' for single-quote
  • \\ for backslash literal
  • \n for newline
  • \_ for space
  • | for alternation
  • ? for optional
  • * for recursive transitive closure
  • + for transitive closure
  • Concatentation by putting two characters next to each other
  • [ ] for selecting a single character from a set
  • [ - ] for selecting a single character in a range
  • [^ ] for inverting a selection

Assume that only the printable ascii tokens exist in the input file.

Behavior

Your dragonlex program should generate its lexer using the techniques described in class: transform each regular expression in the spec to it's expression tree, build a finite state machine from the expression tree, then build a table-based tokenizer that yields the longest-match based token stream.

If more than one match of the same length is found, then the highest regular expression in the spec should be used.

Submission

All of the code required to generate and operate the lexer should be included in a directory named t1. Create a gzipped tarball with the t1 directory in it and submit it to the t1 folder in Blackboard.

Advice

You should consider a pre-processing phase to rewrite each regex into a simpler version. For example, you can rewrite a bracketed selection operator into a set of alternations. You may also rewrite a negated selection operator into a positive selection operator.