In this project, you will begin construction of
dmc, the compiler for
our new language, named (sigh) "
We will continue to build on the codebase started in this project throughout
the semester until we have a fully-functioning compiler.
For this first assignment, you will implement the scanner module using
the scanner-generator Flex.
The scanner of
dmc will transform
character sequences from an input file into a token stream.
The ultimate intention is that the input file will contain
what the user expects to be valid Drewno Mars source code
(though it may be invalid / contain errors).
Of course, the scanner only cares if the input contains
valid tokens so don't expect that input will correspond to
a valid program or try to identify errors beyond those listed below.
In later projects, the token stream will be consumed internally as part
of the compiler's frontend, but for now the code should dump the token
stream in text format, and report any errors to stderr.
The main "action" of the project will be building a lexical specification (a .l file) that recognizes tokens and reports errors according to the format listed below.
You should submit a single tarred-and-zipped directory containing source code and a makefile such that the following invocations build and execute your code:
./dmc <infile.dm> -t <tokens.txt> 2> <errors.txt>
You should include a file named TEAM.txt in your tarball with
each person who worked on the project included in the following format.
You may work with a partner for the project. If you do, include both names. Otherwise, include only your name in
You should expect the input to be a file containing ASCII characters. The lexical details of Drewno Mars are presented in the lexical details section of the language specification. That document should serve as a complete guide for all of the tokens you need to recognize.
Whitespace separates tokens and changes the character counter, but should otherwise be ignored (except inside a string literal).
If you are not sure which token kind matches which token, ask!
The token stream should be output with 1 token per line in the form
<tokenspec> [<l>,<c>]Where <tokenspec> is the name of the token for all tokens except for:
where <string> is the string lexeme matched (including quotation marks).
Newline and tab specifiers should be printed literally.
where <values> is the integer lexeme matched.
where <lexeme> is the lexeme matched.
The scanner should handle the following errors
should be replaced by the position of the first line and character of the error, respectively.
<2 should be replaced by the position of the last line and
character of the error, respectively.)
Issue the error message:
Illegal character <ch>
<ch> is the illegal
character) and resume lexing after the character.
Unterminated string literals:
A string literal is considered to be unterminated if there is a
newline or end-of-file before the closing quote.
Issue the error message:
Unterminated string literal detected
and continue past the unterminated string literal. Start looking for the
next token after the newline.
Bad escape sequences in string literals:
This case occurs if a string inclues a bad
escape sequence; i.e., a backslash followed by a q (since
\q isn't part of Drewno Mars).
Issue the error message:
String literal with bad escape seqeunce detected
and ignore the string literal and start looking for the
next token after the closing quote.
Bad escape sequences in unterminated string literals:
If the string literal has a bad escape sequence
and is unterminated, issue the error message
Unterminated string literal with bad escaped sequence detected
Start looking for the next token after the newline.
Note that a string literal that has a newline immediately after a
backslash should be treated as having a bad escaped character
and being unterminated. For example, given:
"very bad string \
If an integer literal is larger than the 32-bit
of C (i.e. 2147483647)
issue the error message
Integer literal overflow.
Note that because negative numbers are lexed as two tokens, underflow is not reported.Line/Column numbers
For unterminated string literals, bad string literals, and bad integer literals, the line and column numbers used in the error message should correspond to the position of the first character in the string/integer literal.
It would be super cool and totally possible to complete this project from scratch, using only your own code. However, you may choose to use the "starter code". This code contains the basic structure of a reasonable solution for the project. The starter code is provided here: p1.tgz
Hopefully, you can see what this code does by looking at it (it would be good practice to poke around), but there is also some documentation about these files written on the Project 1 doc page.
If you want to set up Flex on your own machine, it's
available in Ubuntu via
apt install flex. If you
are working on an EECS Linux machine, it will already be
Docs: If you need more guidance with Flex for generating lexers, check out the on-line Flex reference manual, and/or the class Flex notes for information about writing a Flex specification.
Gotchas: Flex-generated scanners have some behaviors that can appear surprising to beginners. A couple of things to remember:
and you can see them in action in
For example, the kind for the token to be returned when an integer
literal is recognized is INTLITERAL and the token to be
returned when the reserved word int is recognized is
Note that code telling Flex to return the special EOF token on
end-of-file has already been included in the file
Drewno Mars.l -- you don't have to include
a specification for that token.
The main job of the scanner is to identify and return the next token. The value to be returned includes:
Your scanner will return this information by creating a new Token
object in the action associated with each regular expression that defines a
token (the Token type is defined externally; you don't need to
look at that definition). A Token includes a field of type
int for the token kind, and a field of store auxilary information
about the token (its line, column, and value - if it needs to store a value).
The given .l file already includes an example of how to call the
Token (subclass) constructors.
tokens.cpp file has example
code that accesses the fields of a
In your compiler, the
Token will actually be subclassed, with
a value() function to get the actual value;
that type is defined in lexer.l.
Every TokenVal includes a linenum field, and a
Subtypes of Token with more fields will be used for
the values associated with identifier, integer literal, and string literal
tokens. These subtypes, IntLitToken,
IDToken, and StrLitToken are also defined in
You should keep track of the line and character number, but it's not really
that hard. Line counting is already done for you in the sample file, and column
counting can be done by adjusting the charNum variable according to the amount
of characters consumed.
The code in
does this for some of the patterns that it defines, and you should be able
to figure out how to do the same thing for the new patterns that you add.
The Flex scanner also provides a macro
yytext that returns
the actual text that matches a regular expression.
You will find it useful to use this method in the actions you write
in your Flex specification.
Note that, for the integer literal token, you will need to convert a string (the value scanned) to an int (the value to be returned). You should use code like the following:
double d = std::stod(yytext); // convert String to double // INSERT CODE HERE TO CHECK FOR BAD VALUE -- SEE ERRORS AND WARNINGS BELOW int k = atoi(yytext) // convert to int
A good practice is to write test cases for your project, as discussed in the Flex Lab. You should run your test suite by invoking the command
which runs the test rule in the Makefile (which should, in turn, invoke each element of the test suite).
To test that your scanner correctly handles an unterminated string literal with end-of-file before the closing quote, you will need to create a file that does NOT end in a newline, which many editors do automatically.
On a Linux machine, you can tell that there is no final newline by
typing: cat eof.txt or od -a eof.txt
od will print the newline explicitly. For cat, you should see your command-line prompt at the end of the last line of the output instead of at the beginning of the following line for cat, and the file will not display a newline.
In order to create a file with no final newline, you can issue the command
echo -n "Hello world" > no_newline.txtWhich will create a file "no_newline.txt" containing the text "Hello world" with no newline afterwards. You may also want to create a file with a quote, which can be done using
echo -n "\"Hello world" > no_newline.txt