EECS665 – Compiler Construction

Project 1

Due on September 9th 11:59 PM (Accepted with no penalty).

Accepted late by September 10th 11:59 PM for 1 penalty

Accepted late by September 11th 11:59 PM for 2 penalties

Not accepted on or after September 12th 12:00 AM

Each penalty incurs a 15% stacking reduction on the project grade or uses 1 "penalty token". Penalty tokens are applied by course personnel to your maximum benefit.

Updates

None yet!

Resources

For a list of the tokens in A, check out the lexical details page
The starter code is here: p1.tgz
If you're confused about the correct output for a specific case, you can submit a sample to the Project Oracle and it will respond with the expected output.

Overview

In this project, you will begin construction of ac, the compiler for our new language, named (sigh) "A". We will continue to build on the codebase started in this project throughout the semester until we have a fully-functioning compiler. For this first assignment, you will implement the scanner module using the scanner-generator Flex.

The scanner of ac will transform character sequences from an input file into a token stream. The ultimate intention is that the input file will contain what the user expects to be valid A source code (though it may be invalid / contain errors). Of course, the scanner only cares if the input contains valid tokens so don't expect that input will correspond to a valid program or try to identify errors beyond those listed below. In later projects, the token stream will be consumed internally as part of the compiler's frontend, but for now the code should dump the token stream in text format, and report any errors to stderr.

The main "action" of the project will be building a lexical specification (a .l file) that recognizes tokens and reports errors according to the format listed below.

Requirements
This section contains information that affects how your project will be graded. It should be considered mandatory reading.

Your submission

You should submit a single tarred-and-zipped directory containing source code and a makefile such that the following invocations build and execute your code:


	make


./ac <infile.a> -t <tokens.txt> 2> <errors.txt>

(FYI: what the 2> means)

Where

<infile.a> is an input file to the compiler (described in detail below).
<tokens.txt> is the file to which the token stream will be written in the required output format (described in detail below).
<errors.txt> is the file to which the required error messages will be written in the required error message format (described in detail below).

You should include a file named TEAM.txt in your tarball with each person who worked on the project included in the following format. (lastname1,firstname1) (lastname2,firstname2) You may work with a partner for the project. If you do, include both names. Otherwise, include only your name in TEAM.txt.

Input Format

You should expect the input to be a file containing ASCII characters. The lexical details of A are presented in the lexical details section of the language specification. That document should serve as a complete guide for all of the tokens you need to recognize.

Whitespace separates tokens and changes the character counter, but should otherwise be ignored (except inside a string literal).

If you are not sure which token kind matches which token, ask!

Output Format

The token stream should be output with 1 token per line in the form

<tokenspec> [<l>,<c>]

Where <tokenspec> is the name of the token for all tokens except for:

string literals (STRINGLITERAL), which will be of the form STRINGLIT:<string> where <string> is the string lexeme matched (including quotation marks). Newline and tab specifiers should be printed literally.
integer literals (INTLITERAL), which will be of the form INTLIT:<value> where <values> is the integer lexeme matched.
Identifiers (ID), which will be of the form ID:<lexeme> where <lexeme> is the lexeme matched.

Errors and Warnings

The scanner should handle the following errors (<l_{1_>} and <c₁> should be replaced by the position of the first line and character of the error, respectively. <₂ and <₂ should be replaced by the position of the last line and character of the error, respectively.)

Illegal characters: Issue the error message: FATAL [<l₁>,<c₁>-<l₂>,<c₂>]: Illegal character <ch> (where <ch> is the illegal character) and resume lexing after the character.

Unterminated string literals: A string literal is considered to be unterminated if there is a newline or end-of-file before the closing quote. Issue the error message: FATAL [<l₁>,<c₁>-<l₂>,<c₂>]: Unterminated string literal detected and continue past the unterminated string literal. Start looking for the next token after the newline.

Bad escape sequences in string literals: This case occurs if a string inclues a bad escape sequence; i.e., a backslash followed by a q (since \q isn't part of A). Issue the error message: FATAL [<l₁>,<c₁>-<l₂>,<c₂>]: String literal with bad escape seqeunce detected and ignore the string literal. Start looking for the next token after the closing quote.

Bad escape sequences in unterminated string literals: If the string literal has a bad escape sequence and is unterminated, issue the error message FATAL [<l₁>,<c₁>-<l₂>,<c₂>]: Unterminated string literal with bad escaped sequence detected Start looking for the next token after the newline. Note that a string literal that has a newline immediately after a backslash should be treated as having a bad escaped character and being unterminated. For example, given:


"very bad string \

abc

the scanner should report an unterminated string literal with a bad escaped character on line 1, and an identifier on line 2.

Integer literals overflow

If an integer literal is larger than the 32-bit INT_MAX of C (i.e. 2147483647) issue the error message FATAL [<l₁>,<c₁>-<l₂>,<c₂>]: Integer literal overflow.

Note that because negative numbers are lexed as two tokens, underflow is not reported.

Line/Column numbers

For unterminated string literals, bad string literals, and bad integer literals, the line and column numbers used in the error message should correspond to the position of the first character in the string/integer literal.

Advice
This section contains tips and guidance. It may be helpful, but it's optional reading.

Getting Started

It would be super cool and totally possible to complete this project from scratch, using only your own code. However, you may choose to use the "starter code". This code contains the basic structure of a reasonable solution for the project. The starter code is provided here: p1.tgz

Hopefully, you can see what this code does by looking at it (it would be good practice to poke around), but there is also some documentation about these files written on the Project 1 doc page.

Flex

Setup: If you want to set up Flex on your own machine, it's available in Ubuntu via apt install flex. If you are working on an EECS Linux machine, it will already be installed.

Docs: If you need more guidance with Flex for generating lexers, check out the on-line Flex reference manual, and/or the class Flex notes for information about writing a Flex specification.

Gotchas: Flex-generated scanners have some behaviors that can appear surprising to beginners. A couple of things to remember:

Scanners start greedily consuming characters from the end of the last match until no longer token can possibly be produced. If there are multiple tokens that could all be of that same longest length, the rule that appears earliest in the Flex file is used. Thus, you should make sure your reserved keywords appear in your Flex file before your identifiers to ensure keywords pre-empt identifiers.
Token "kinds" (i.e., values to be returned by the scanner) are defined in the file frontend.hh, and you can see them in action in scanner.cpp, For example, the kind for the token to be returned when an integer literal is recognized is INTLITERAL and the token to be returned when the reserved word int is recognized is INT.
Note that code telling Flex to return the special EOF token on end-of-file has already been included in the file A.l -- you don't have to include a specification for that token.

What the Scanner Should Do

The main job of the scanner is to identify and return the next token. The value to be returned includes:

The token "kind" (e.g., INTLITERAL). Token kinds are defined in the file frontend.hh.
The line number in the input file on which the token starts.
The number of the character on that line at which the token starts.
For identifiers, integer literals, and string literals: the actual value (a String, an int, or a String, respectively). For a string literal, the value should include the double quotes that surround the string, as well as any backslashes used inside the string as part of an "escaped" character.

Your scanner will return this information by creating a new Token object in the action associated with each regular expression that defines a token (the Token type is defined externally; you don't need to look at that definition). A Token includes a field of type int for the token kind, and a field of store auxilary information about the token (its line, column, and value - if it needs to store a value). The given .l file already includes an example of how to call the Token (subclass) constructors. The given tokens.cpp file has example code that accesses the fields of a Token.

In your compiler, the Token will actually be subclassed, with a value() function to get the actual value; that type is defined in lexer.l. Every TokenVal includes a linenum field, and a charnum field. Subtypes of Token with more fields will be used for the values associated with identifier, integer literal, and string literal tokens. These subtypes, IntLitToken, IDToken, and StrLitToken are also defined in A.l.

You should keep track of the line and character number, but it's not really that hard. Line counting is already done for you in the sample file, and column counting can be done by adjusting the charNum variable according to the amount of characters consumed. The code in A.l does this for some of the patterns that it defines, and you should be able to figure out how to do the same thing for the new patterns that you add.

The Flex scanner also provides a macro yytext that returns the actual text that matches a regular expression. You will find it useful to use this method in the actions you write in your Flex specification.

Note that, for the integer literal token, you will need to convert a string (the value scanned) to an int (the value to be returned). You should use code like the following:

double d = std::stod(yytext); // convert String to double
// INSERT CODE HERE TO CHECK FOR BAD VALUE -- SEE ERRORS AND WARNINGS BELOW
int k =  atoi(yytext)    // convert to int

Testing

A good practice is to write test cases for your project, as discussed in the Flex Lab. You should run your test suite by invoking the command

make test

which runs the test rule in the Makefile (which should, in turn, invoke each element of the test suite).

To test that your scanner correctly handles an unterminated string literal with end-of-file before the closing quote, you will need to create a file that does NOT end in a newline, which many editors do automatically.

On a Linux machine, you can tell that there is no final newline by typing: cat eof.txt or od -a eof.txt
od will print the newline explicitly. For cat, you should see your command-line prompt at the end of the last line of the output instead of at the beginning of the following line for cat, and the file will not display a newline.

In order to create a file with no final newline, you can issue the command

echo -n "Hello world" > no_newline.txt

Which will create a file "no_newline.txt" containing the text "Hello world" with no newline afterwards. You may also want to create a file with a quote, which can be done using

echo -n "\"Hello world" > no_newline.txt