EECS665 – Compiler Construction – Drew Davidson $\newcommand{\bigsqcap}{\mathop ⨅}$

    Flex


    Contents

    Overview

    Flex is a scanner generator that produces C++ code. Here's a picture illustrating how to create and run a program using Flex:

    image/svg+xml g++ -c Flex g++ -c g++ -o prog FlexScanner.o main.o Scanner Source Code (.cpp file) main.cpp (.cpp file) Scanner Specification (.lex file) Character Stream Tokenized Output ./prog

    The input to Flex is a specification that includes a set of regular expressions and associated actions. By default, the output of Flex is a C source program that runs a scanner. Using the options we will be using, Flex will output a C++ source file that defines a class encapsulating the scanner. In case of the C program, there is a function yylex. In case of the C++ program, the yylex function will be a member of the scanner class. yylex includes a constructor that allows for an input stream and/or output stream (of class istream and ostream, respectively) It also includes a method called yylex, which returns the next token in the input.

    The picture above assumes that a class named P2 has been defined that contains the core program of interest. That program will declare an object of (abstract) type FlexLexer, and initialize it to the concrete subclass yyFlexLexer. The program then repeatedly calls the yylex function to get tokens until the FlexLexer runs out of tokens.

    Format of a Flex Specification

    A Flex specification has three parts, separated by double percent signs:

    1. User code: this code is plopped into the output code and is appropriate for code definitions used in the actions.
    2. Flex directives: This includes macro definitions (described below). See the Flex Reference Manual for more information about this part of the specification.
    3. Regular expression rules: These rules specify how to divide up the input into tokens. Each rule includes an optional state list, a regular expression, and an associated action.
    We will discuss the regular expression rules part first.

    Regular Expressions Rules

    The state-list part of a rule is discussed below. Ignoring state-lists for now, the form of a regular expression rule is:

    image/svg+xml regular expression { action } The pattern to code to execute when the pattern is matched be matched

    When the scanner's yylex method is called, it repeats:

    1. Find the longest sequence of characters in the input (starting with the current character) that matches a regular-expression pattern.
    2. Perform the associated action.

    until an action causes the yylex method to return. If there are several patterns that match the same (longest) sequence of characters, then the first such pattern is considered to be matched (so the order of the regular-expression rules can be important).

    If an input character is not matched in any pattern, the scanner throws an exception. It is not good to have a scanner that can "crash" on bad input, so it is important to make sure that there can be no such unmatched characters!

    The regular expressions are similar to the ones discussed in the scanner notes. Here's how they are used to match the input:

    • Most characters match themselves. For example:
      • abc
      • ==
      • while
      are three patterns that match exactly those sequences of characters (note that writing one character after another means "followed by" as usual).

    • Characters (even special characters, except backslash) enclosed in double quotes match themselves. For example, the following patterns are equivalent to the three given above:
      • "abc"
      • "=="
      • "while"
      And the following pattern:

      "a|b"

      matches the three-character sequence: a then | then b, rather than matching a single a or a single b.

    • The following characters have the usual special meanings as regular expression operators:
      | means "or"
      * means zero or more instances of
      + means one or more instances of
      ? means zero or one instance of
      ( ) are used for grouping

    • The dot character matches any character except the newline character. It is usually used in the last rule in the specification, to match all "bad" characters (and the associated action issues an error message).

    • The backslash is a special escape character:
      \nnewline
      \ttab
      \"double quote
      To match a backslash character, put two backslashes in a character class (see below). See the Flex Reference Manual for a complete list of the special characters escaped by a backslash.

    • The carat and dollar-sign characters: ^ and $, are special characters. When the carat is used as the first character in a pattern, it causes the pattern to match only at the beginning of a line (i.e., only if the previous character was a newline). When the dollar sign is used as the last character in a pattern, it causes the pattern to match only at the end of a line (i.e., only if the next character is a newline).

    • The regular expression can include character classes, delimited by square brackets:
      • A character class will match one character.
      • If no special characters are used inside the character class, then the character class matches any of the characters it includes inside its square brackets. For example: [abc] matches an a, or a b, or a c, so it is the same as: a|b|c.
      • Here are the characters that are "special" inside a character class:
        - means a range of characters; e.g., a-z means "a to z".
        ^ is only a special character if it is the first character in the square brackets; it means not any of the following characters. So for example, [^abc] matches any character other than an a, or a b, or a c.
        \ is used as an escape character with n, b, ", etc as usual; it can also be used to escape the characters that are special inside a character class (e.g., [a\-z] matches an a or a - or a z, and [\\] matches a backslash.
        " can be used around characters that are special inside a character class to make them match themselves (e.g., ["\"] matches a backslash, and ["-"] matches a hyphen. To include a double-quote character in a character class, escape it with a backslash.

    Note that whitespace only matches itself if it is inside quotes or in a character class; otherwise, it ends the current pattern. So the two rules:

    [a bc] {}

    and

    a|" "|b|c {}

    are equivalent; each matches an a, or a space, or a b, or a c, while the rule:

    a bc {}

    causes an error when you try to process your specification.


    You Try #1

    Question 1: The character class [a-zA-Z] matches any letter. Write a character class that matches any letter or any digit.

    Question 2: Write a pattern that matches any Pascal identifier (a sequence of one or more letters and/or digits, starting with a letter).

    Question 3: Write a pattern that matches any C identifier (a sequence of one or more letters and/or digits and/or underscores, starting with a letter or underscore).

    Question 4: Write a pattern that matches any C identifier that does not end with an underscore.

    solution


    Flex directives

    Recall that the second part of a Flex specification contains directives. This can include specifying the value that should be returned on end-of-file, specifying that line counting should be turned on, and specifying that the scanner will be used with the Bison parser generator. (See the Flex Reference Manual for more information about directives.)

    The directives part also includes macro definitions. The form of a macro definition is:

    name regular-expression

    where name is any valid C identifier, and regular-expression is any regular expression as defined above.

    Here are some examples:

    DIGIT [0-9]
    LETTER [a-zA-Z]
    WHITESPACE [ \t\n]

    Once a macro has been defined, it can be used in a regular expression (either to define another macro, or in the "Regular Expression Rules" part of the Flex specification. To use a macro, just use its name inside curly braces. For example, given the above macro definitions, the following pattern could be used to match Pascal identifiers:

    {LETTER}({LETTER}|{DIGIT})*

    You Try #2

    Define a macro named NOTSPECIAL that matches any character except a newline, double quote, or backslash.

    solution


    Comments

    You can include comments with no special actions in the first and third parts of your Flex specification. In the second part, you should prefix your comment with a space character, or else Flex will think your comment is a regex pattern. think they are part of a pattern). Flex comments are like C++ multiline comments: they start with /* and end with */

    States

    Recall that each regular expression rule (a pattern and the action to be performed when the pattern is matched) can optionally include a list of states at the beginning of the pattern. For example:

    <STATE1, STATE2>"abc" { }

    is a rule that starts with a list of two states (named STATE1 and STATE2).

    Each time the scanner is called, it is in some state. Initially, it is in a special state called INITIAL. It will stay in that state unless it matches a pattern whose corresponding action includes code that causes it to change to another state. For example, given the rule:

    "xyz" { BEGIN(STATE1); }

    if the input contains "xyz", then the call to BEGIN will be executed, and the scanner will enter the STATE1 state.

    If a rule has no list of states, then it will be matched in any state; however, if it has a list of states, then it will be matched only when the scanner is in one of those states. So for example, the rule for "abc" given above will only be matched after the rule for "xyz" has been matched.

    Every state other than INITIAL must be declared in the Flex directives part of the Flex specification. Here's an example declaration:

    %x STATE1

    Suppose that for floating-point numbers you want your scanner to return two values: the value before the decimal point, and the value after the decimal point. Here's an example of using a Flex state to do that (using some pseudo-code):

    DIGIT [0-9]
    DOT   "."
    
    %x DOTSTATE
    
    %%
    
    <INITIAL>{DIGIT}+{DOT}  { BEGIN(DOTSTATE);
                                  -- save the value so far --
                            }
    
    <DOTSTATE>{DIGIT}+      { BEGIN(INITIAL);
                                  -- return the saved value and the new one --
                            }

    You Try #3

    A quoted string consists of three parts:

    1. A double quote.
    2. Some text.
    3. A double quote.
    The text can contain any characters except a newline or a single double-quote character. It can contain an "escaped" quote, which is two double-quote characters in a row.

    Use Flex states to write a specification to recognize quoted strings, and to return the number of escaped quotes in each such string. To declare a counter, declare a class with a static, public int field, in the "User Code" part of the Flex specification, and update/return that static field.

    solution


    yyline and yytext

    If you turn line counting on (by including yylineno in the "directives" part of the specification), you can use the variable yylineno in the actions that you write for the regular expressions. The value of yyline will be the current line number in the input file, counting from zero (so to use that number in error messages printed by your scanner, you will need to add one to yylineno).

    You can also use the yytext global in your actions. This global contains the sequence of characters that was just matched.

    A Small Example

    Here is a small (but complete!) Flex specification:

    DIGIT		[0-9]
    LETTER		[a-zA-Z]
    WHITESPACE	[ \t\n]
    
    %option c++
    
    %%
    
    {LETTER}({LETTER}|{DIGIT})* { std::cout << "ID " << yytext;}
    {DIGIT}+                    { std::cout << "INT";}
    "="                         { std::cout << "ASSIGN";}
    "=="                        { std::cout << "EQUALS";}
    {WHITESPACE}*               { }
    .               	    { std::cout << "bad char";}
    
    %%
    
    int main(){
    	FlexLexer * lexer = new yyFlexLexer();
    	lexer->yylex();
    }
    
    

    Note that the actions in this example are not what you would really put in a Flex specification for a scanner. Instead of printing, the first four actions should return the appropriate tokens.

    Quick Reference Guide

    Operators and Special Symbols in Flex

    The following table summarizes the operators and special symbols used in Flex. Note that some characters have an entirely different meaning when used in a regular expression and in a character class. Character classes are always delimited by square brackets; they can be used in the regular expressions that define macros, as well as in the regular expressions used to specify a pattern to be matched in the input.

    Symbol Meaning in Regular Expressions Meaning in Character Classes
    ( Matches with ) to group sub-expressions. Represents itself.
    ) Matches with ( to group sub-expressions. Represents itself.
    [ Begins a character class. Represents itself.
    ] Is illegal. Ends a character class.
    { Matches with } to delimit a macro name. Matches with } to delimit a macro name.
    } Matches with { to delimit a macro name. Represents itself or matches with { to delimit a macro name.
    " Matches with " to delimit strings (only \ is special within strings). Matches with " to delimit a string of characters that belong to the character class.  Only \" is special within the string.
    \ Escapes special characters (n, t, etc). Also used to specify a character by its octal, hexadecimal, or unicode value. Escapes characters that are special inside a character class.
    . Matches any one character except newline. Represents itself.
    | Alternation (or) operator. Represents itself.
    * Kleene closure operator (zero or more matches). Represents itself.
    + Positive closure operator (one or more matches). Represents itself.
    ? Optional choice operator (zero or one matches). Represents itself.
    ^ Matches only at beginning of a line. When it is the first character in the character class, complements the remaining characters in the class.
    $ Matches only at end of a line. Represents itself.
    - Represents itself. Range of characters operator.