EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$



    Topic IntroductionL

    A fundamental consideration in the design and implementation of any programming language is to determine which operations are invalid: if the programmer can express what values are valid or acceptable, automated tools can be leveraged to ensure that values adhere to that notion of acceptability.

    One of the key ways that programmers express what is acceptable in the language is through the use of types. While most programmers have an intuitive understanding of the concept of types, these notions need to be formalized in order to implement a programming language.

    Type VocabularyL

    In order to work with types, we will need a definition. As with many topics in compilers and programming languages, the wide variety of uses for the term "type" can make it difficult to arrive at a meaningful definition that is also broad enough to capture its practical use. For our purposes, we will define a type as a constraint upon the values that a particular program construct may take. For example, if we declare a variable v to be of type int32, we specify the constrains that the value is an integer that can be represented in 32-bits (and thus carries a maximum and minimum value defined by the integer representation). As another example, we might declare a variable u to be of type unsigned int32, which specifies the constraint that the value must be non-negative and thus raises the minimum value that u may hold compared to v. As a result, most implementations would use the extra bits that otherwise would be used to represent negative values in order to represent larger positive values, and thus u may take on larger values than v.

    In order to consider the full generality of even the most popular programming languages, it is worthwhile to consider some of the things that are not requirements on types:

    • Types need not be specified explicitly by the programmer. Some languages (such as assembly languages) do not attach a type to a "variable" at all. Instead, the operation carries with it a specifier for what a typing on the operands.
    • The language may internally assign types to program constructs at compile time when the types are not explicitly identified by the programmer, which is known as type inference. Type inference relies on algorithmic definitions to determine to determine a possible assignment of types to program constructs (and will often reject a program if no possible typing can be derived).
    • The language may not require that the types of constructs remain the same throughout the lifetime of the program's execution, or may not assign types until the program is run. Such languages are known as dynamically-typed languages, in contrast to the statically-typed languages that assign types at compile time.

    Type RulesL

    The type rules of a language define how to determine expression types, and what is considered to be an error. The type rules specify, for every operator (including assignment), what types the operands can have, and what is the type of the result. For example, both C++ and Java allow the addition of an int and a double, and the result is of type double. However, while C++ also allows a value of type double to be assigned to a variable of type int, Java considers that an error.

    Types Self Study

    List as many of the operators that can be used in a Java program as you can think of (don't forget to think about the logical and relational operators as well as the arithmetic ones). For each operator, say what types the operands may have, and what is the type of the result.

    Type CheckingL

    Once type rules are established in a language, the effect of those rules need to be respected throughout the program. The strictness of the type rules varies from language to language, so the amount of work that the compiler has to do to ensure adherence to the type rules varies. Determining violations of the type rules is referred to as type checking.

    Determining Expression Types

    Expressions are entities in a program that yield a value, and thus it can be said that expressions have a type (which is a shorthand for saying that the value yielded by the expression is of that type). Determining the type held by a particular variable or used in a (sub)expression is often simply called typing. A program for which there is no valid typing is referred to being ill-typed, and a program for which a valid typing is assigned is referred to as being well-typed.

    The wide variety of languages, with their various type rules, mean that different techniques for typing. In a statically-typed language, types are determined at compile time. In a dynamically-typed language, the type of an expression is determined at runtime.

    Static typing is typically associated with languages that are compiled to native code, because different machine code instructions need to be generated based on the operands of an expression. For example, many hardware architectures have distinct floating-point and integer division instructions. Thus the correct instruction depends on the types being final code is run. Although it is feasible to target native code with a dynamically-typed language, the compiler would need to output instruction sequences for each possible version of the operation that could reach the expression, as well as auxiliary instructions so the correct sequence is identified and executed.

    One advantage of static typing is that ill-typed programs can be identified before the program is deployed. However, Rice's Theorem looms large over static typing: there will necessarily be an input program in which the typing cannot be accurately determined. Consequently, approximate algorithms are used for static typing. Approximations are manifested in a variety of ways.

    Finding Type Errors

    In addition to finding type errors caused by operators being applied to operands of the wrong type, the type checker must also find type errors having to do with expressions that, because of their context must be boolean, and type errors having to do with function calls. Examples of the first kind of error include:

    • the condition of an if statement
    • the condition of a while loop
    • the termination condition part of a for loop
    and examples of the second kind of error include:
    • calling something that is not a method
    • calling a method with the wrong number of arguments
    • calling a method with arguments of the wrong types
    A Simple Typing AlgorithmL

    Some languages have comparatively complex typing algorithms, involving type inference in which the type of a variable is determined without an explicit directive from the programmer. For the purposes of this class, we will use a much simpler algorithm that relies on a simple recursive walk over the AST, leveraging information populated by name analysis (one pass that performs name analysis and type analysis is possible, but conceptually we will consider name analysis to take place first).

    Since expressions are the only program constructs that yield types, only AST nodes that are expression subclasses need to be assigned a type. Although other nodes will need to facilitate the type analysis pass, they will simply ensure that any enclosed expressions are reached by the analysis. Assigning types is slightly different based on the variety of expression:

    • Literals will have a type that can be immediately determined based on the literal's value. Some literal values, are admissable in exactly one type, and thus can be assigned to that type. For example, literal true is (obviously) boolean. A numeric literal may be assigned to the "simplest" type for which the value is a member. For example, a node for the literal 1 will be assigned integer type, rather than a floating-point type, whereas 1.0 will be a floating-point.
    • Variable Identifiers will have a type that is determined via name analysis: whichever type specified in the declaration to which the identifier is bound will be the type of the identifier.
    • Function Identifiers will have a type that is the return type of the function, which is explicit in the function signature in C-link languages. Additionally, the expression for each actual argument needs to be compatible with the matching formal argument type declared explicitly in the function signature.
    • Arithmetic Expressions will have a type that is recursively defined by the types of the operand subexpressions. For instance, to evaluate a plus expression, recursively determine the type of the left operand subexpression, then recursively determine the type of the right operand. If both subexpression types are integers, the plus is integer addition and therefore yields an integer result type. If both subexpression types are floating-point, the plus is floating-point addition and therefore yields an integer result type. If the types are different, the type promotion rules come into play. For example, in C, if the left subexpression type is int and the right subexpression is float, then the int will be promoted to float and the plus will be floating-point addition. It is also worth noting that if the types are incompatible and cannot be made compatible via promotion, then the plus represents an ill-typed operation, and yields an error type.