EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$

    Overview

    Sections

    Welcome

    Hello, and welcome to the course readings for Drew Davidson's EECS 665, "Compiler Construction". This set of course notes has been shaped by me (Drew) in order to serve as a written companion to the course lectures and discussions offered as part of EECS665. As such, it's written pretty informally, and will often draw on my personal observations and experience. These readings are not intended to serve as a standalone reference for compiler topics or as a guide for building a compiler. Thus, I'll start these readings with a disclaimer exactly opposite of most textbooks: it is not for a general audience, and it is intended to be read straight through from start to back.

    Some additional disclaimers: In the interest of brevity, these readings will occasionally adopt a definition and stick to it without fully exploring alternative definitions, or considering all implications of the definition. If you feel that a superior definition exists, please do let me know. Furthermore, the readings constitute a living document, updated and refined when possible. If you see a typo, mistake, or even just something where you'd like some more writing, please do let me know.

    While much of the material in the course readings has been updated, modified, or written by me, it is based originally on a set of readings created by Susan Horwitz at the University of Wisconsin-Madison. In the interest of providing the best possible guide to the subject, I will frequently read or re-read classic textbooks on compilers, and whatever course materials from other classes might be floating around on the internet. It is my intention to synthesize this information and provide wholly original material herein. That being said, if you see an example or text in the readings that you feel has not been appropriately credited to an external source, please contact me at your earliest convenience so that we can arrange for the material to be taken down, cited, or come to another arrangement.

    Introduction

    What is a compiler? A common answer amongst my students is something like "A program that turn code into a program", which is honestly not bad (if a bit imprecise). Without dwelling overly much on the messy details, let's refine this definition and run with it. For this class, we'll say that a compiler is:

    • A recognizer of some source language S.
    • A translator (of programs written in S into programs written in some object or target language T).
    • A program written in some host language H

    Here's a simple pictorial view:

    image/svg+xml   Compiler Source code Written inlanguage S Target code Programwritten inlanguage H Written inlanguage T ErrorLog Input Output

    A compiler operates in phases; each phase translates the source program from one representation to another. Different compilers may include different phases, and/or may order them somewhat differently. A typical organization is shown below.

    x can be replaced with y SourceProgram intermediatecode generator Sequence of characters ErrorLog semantic analyzer syntax analyzer(parser) optimizer lexical analyzer(scanner) code generator Sequence of tokens Abstract-syntax tree (AST) Augmented, annotated AST Intermediate code Optimized intermediate code objectprogram Assembly or machine code

    Below, we look at each phase of the compiler, giving a high-level overview of how the code snippet position = initial + rate + 60 is represented throughout the above workflow.

    The Scanner

    The scanner performs lexical analysis, which means to group the input characters into the fundamental units of the source language. If some character(s) of input fails to correspond to a valid group, the scanner may also report lexical errors.

    Some terminology for lexical analysis:

    • The scanner groups the characters into lexemes (sequences of characters that "go together").
    • Each lexeme corresponds to a token; the scanner forwards each token (plus maybe some additional information, such as a position within at which the token was found in the input file) to the parser.

    The definitions of what is a lexeme, token, or bad character all depend on the source language.

    Example 1

    Here are some C-like lexemes and the corresponding tokens:

      lexeme:                   ;           =        index      tmp      37       102
      
      corresponding token:  semicolon     assign       id        id       intlit     intlit
      

    Note that multiple lexemes can correspond to the same token (e.g., the index and tmp lexemes both correspond to id tokens.

    Example 2 (the running example)

    Given the source code:

      position  =  initial +  rate   *    60 ;
      

    a C-like scanner would return the following sequence of tokens:

      id assign id plus id times intlit semicolon
      
    Error Handling in the Scanner

    The errors that a scanner might report are usually due to characters that cannot be a valid part of any token.

    Erroneous characters for Java source include hash (#) and control-a.
    Erroneous characters for C source include the carat (^).

    The Parser

    The parser performs syntactic analysis, which means to group the tokens of the language into constructs of the programming language (such as loops, expressions, functions, etc). A good analogy is to think of the scanner as building the "words" of the programming language, and the parser as building the "sentences" or "grammatical phrases" of the source program.

    • Groups tokens into "grammatical phrases", discovering the underlying structure of the source program.
    • Finds syntax errors. For example, the source code
      5 position = ;
      could correspond  to the sequence of tokens:
      intlit id assign semicolon
      All are legal tokens, but that sequence of tokens is erroneous.
    • Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared.
    • Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.

    Example

    source code:   position = initial + rate * 60 ;

    Abstract syntax tree:

    image/svg+xml position rate 60 * Initial = +

    Key features of the Abstract-Syntax Tree (AST):

    • The interior nodes of the tree are operators.
    • A node's children are its operands.
    • Each subtree forms a "logical unit", e.g., the subtree with * at its root shows that because multiplication has higher precedence than addition, this operation must be performed as a unit (not initial+rate).

    Reality Check!

    While the above presentation has implied that syntactic analysis comes strictly after lexical analysis has been completed, the two phases can often be interleaved, with the parser serving as the "driver" module: the parser sends requests to the lexical analysis to produce one token at a time, which is then added to the AST or rejected if syntactically invalid. Thus, the AST is built incrementally. One reason for this behavior is that if problems are detected early in the parse, the compiler can emit an error quickly. This "fail-fast" behavior is a major design goal in compiler construction, as it can save users time and compute cycles.

    The Semantic Analyzer

    The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. It may also annotate and/or change the abstract syntax tree (e.g., it might annotate each node that represents an expression with its type).

    Example:

    Abstract syntax tree before semantic analysis:

    Abstract syntax tree after semantic analysis:

    image/svg+xml position rate intToFloat * Initial = + 60 float int float float float float float float Type Promotion

    The Intermediate Code Generator

    The intermediate code generator translates from abstract-syntax tree to intermediate code. One possibility is 3-address code (code in which each instruction involves at most 3 operands). Below is an example of 3-address code for the abstract-syntax tree shown above. Note that in this example, the second and third instructions each have exactly three operands (the location where the result of the operation is stored, and two operators); the first and fourth instructions have just two operands ("temp1" and "60" for instruction 1, and "position" and "temp3" for instruction 4).

    		temp1 = inttofloat(60)
    		temp2 = rate * temp1
    		temp3 = initial + temp2
    		position = temp3
    

    The Optimizer

    The optimizer tries to improve code generated by the intermediate code generator. The goal is usually to make code run faster, but the optimizer may also try to make the code smaller. In the example above, an optimizer might first discover that the conversion of the integer 60 to a floating-point number can be done at compile time instead of at run time. Then it might discover that there is no need for "temp1" or "temp3". Here's the optimized code:

    		temp2 = rate * 60.0
    		position = initial + temp2
    

    The Code Generator

    The code generator generates assembly code from (optimized) intermediate code. For example, the following x64 code might be generated for our running example:

            movl    rate(%rip), %eax
            imull   $60, %eax, %edx
            movl    initial(%rip), %eax
            addl    %edx, %eax
            movl    %eax, position(%rip)
    

    It's worth noting that the above representation of the running example is still a text file, though one that the compiler proper considers to be the final output. There is still work left to be done before we have an executable program. Usually, a collection of programs, collectively referred to as a compiler toolchain, work in sequence to build the executable:

    • The assembler transforms an assembly code file (such as one produced by the compiler) into a relocatable, which is either an object file (typically given the .o extension) or a library (typically given the .so or .a extension, for a dynamically-linked or statically-linked library, respectively).
    • The linker merges the various object files and static libraries output by the assembler into a single executable.

    When it comes time for the program to actually be run, the loader maps the executable file into memory (alongside whatever dynamic libraries may be required) as a process, and then hands control to the operating system to actually begin execution of the program. In modern workstation computing, the loader is integrated into the Operating System, but in some legacy, specialized, and embedded systems the loader is still a distinct program.

    Students who use compilers like gcc or llvm are often surprised at the existence of distinct "post-compilation" steps to produce an executable. Indeed the term "compiler" is often used to refer to the full "compiler toolchain".

    Nevertheless, the workflow from compilation, to assembly, to linkage, is represented as a series of distinct modules within many popular compilers, and each of the tools can be invoked separately. For example, running the toolchain for C gcc can be accomplished using cc (the c compiler proper), as (the gnu assembler), and ld (the gnu linker).