Writing a Static Analyser in Rust

To try out the concept of Literate Programming (using the awesome tango crate), I'm going to write a small static analyser for a basic Programming language. Because it's so much more interesting to use a programming language available in the wild, compared to some contrived example, we're going to analyse Delphi (a Pascal variant).

Believe it or not, but this entire book is the actual source code for this static analyser. Check out the repo on GitHub if you want to see more.

Here's your basic Hello World:

procedure TForm1.ShowAMessage;
begin
  ShowMessage('Hello World!');
end;

All you need to do is use the IDE to hook that function up to be run whenever a button is clicked and it'll show a "Hello World!" dialog.

Note: The API docs for this crate should be placed alongside the book. You can access then here (you'll need to use cargo doc --open if viewing locally).

First up, lets add some top-level docs and import some crates we're going to need:


# #![allow(unused_variables)]
#fn main() {
//! A parser and static analysis library for exploring Delphi code.
//!
//! This is written using a *Literate Programming* style, so you may find it
//! easier to inspect the [rendered version] instead.
//!
//! [rendered version]: https://michael-f-bryan.github.io/static-analyser-in-rust/

#![deny(missing_docs)]

#[cfg(test)]
#[macro_use]
extern crate pretty_assertions;
#[macro_use]
extern crate error_chain;
extern crate serde;
#[macro_use]
extern crate serde_derive;
#}

There are several steps you need to perform to do static analysis, first is tokenizing (often called lexical analysis). This stage turns the characters in the raw source code into Tokens like "if", "begin", integer literals and operators.


# #![allow(unused_variables)]
#fn main() {
pub mod lex;
#}

Next we do the parsing stage (semantic analysis) to convert our stream of tokens into an Abstract Syntax Tree. This is a representation of the program as it exists on disk.

A lot of static analysis passes can get away with working purely at the AST level. For example, if you want to make sure people don't accidentally divide by zero it's just a case of looking for all division nodes and checking that the right hand isn't a numeric literal representing zero (e.g. 0 or 0.0).

Another useful lint which can be applied at this level is cyclomatic complexity, i.e. how "complex" a function/procedure is. This is normally just a case of walking the body of a function and counting the number of branches, loops, and try/catch blocks encountered.


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
pub mod parse;
#}

The third step is type checking and generating a High level Intermediate Representation (HIR), often referred to as "lowering" (converting from a high level representation to a lower one).

While the AST is very flexible and useful, it works at the language syntax level and completely misses the semantics of a language. This means an expression like 'foo' + 42 or dereferencing a float is a perfectly valid AST node.

To perform some of the more advanced analyses we'll need to have access to the full context surrounding an expression to determine if it is valid. This typically involves figuring out the type for each variable and expression, as well as resolving imports and stitching multiple unit files into one cohesive data structure.


# #![allow(unused_variables)]
#fn main() {
pub mod lowering;
#}

Now we've finally resolved all imports and types we're guaranteed to have a syntactically and semantically valid program. This doesn't mean it's correct though! At this stage we can create passes which employ the full strength of the compiler/static analyser to check the logic of our program. This lets us do


# #![allow(unused_variables)]
#fn main() {
pub mod analysis;
#}

We also need to handle internal errors. To keep things clean lets put that in its own module too.


# #![allow(unused_variables)]
#fn main() {
pub mod errors;
#}

Another very important thing to have is a mapping which lets you talk about a logical chunk of code (i.e. this function body or that string literal) and retrieve the corresponding source code. This will be crucial for the later steps where we want to indicate where an error occurred to the user.


# #![allow(unused_variables)]
#fn main() {
pub mod codemap;
#}

Finally, there's the Driver. He's in charge of the show an is usually the thing you'll want to invoke or hook into to tweak the analysis process.


# #![allow(unused_variables)]
#fn main() {
mod driver;
pub use driver::Driver;
#}

A Note on Project Design

A lot of the time, if you need to write a parser you'll want to use some sort of parser combinator or generator library. This greatly decreases the effort and time required, but you often trade that off with poor error handling and error messages. Because we're writing a tool for analysing your code, it stands to reason that if the user passes in dodgy code, we can detect this (without crashing) and emit a useful error message. All of this means that we'll want to write the lexing and parsing stuff by hand instead of deferring to another tool.

If you are following along at home, click through to one of the sections to learn about it in more detail.

Lexical Analysis

It's always nice to add doc-comments so rustdoc knows what this module does.


# #![allow(unused_variables)]
#fn main() {
//! Module for performing lexical analysis on source code.
#}

Before anything else, lets import some things we'll require.


# #![allow(unused_variables)]
#fn main() {
use std::str;
use codemap::Span;
use errors::*;
#}

A lexer's job is to turn normal strings (which a human can read) into something more computer-friendly called a Token. In this crate, a Token will be comprised of a Span (more about that later), and a TokenKind which lets us know which type of token we are dealing with. A TokenKind can be multiple different types representing multiple different things, so it makes sense to use a Rust enum here.


# #![allow(unused_variables)]
#fn main() {
/// Any valid token in the Delphi programming language.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[allow(missing_docs)]
#[serde(tag = "type")]
pub enum TokenKind {
    Integer(usize),
    Decimal(f64),
    Identifier(String),
    QuotedString(String),
    Asterisk,
    At, 
    Carat, 
    CloseParen, 
    CloseSquare, 
    Colon,
    Dot, 
    End,
    Equals,
    Minus, 
    OpenParen, 
    OpenSquare, 
    Plus,
    Semicolon,
    Slash,
}
#}

We'll also want to implement some helpers to make conversion more ergonomic.


# #![allow(unused_variables)]
#fn main() {
impl From<String> for TokenKind {
    fn from(other: String) -> TokenKind {
        TokenKind::Identifier(other)
    }
}

impl<'a> From<&'a str> for TokenKind {
    fn from(other: &'a str) -> TokenKind {
        TokenKind::Identifier(other.to_string())
    }
}

impl From<usize> for TokenKind {
    fn from(other: usize) -> TokenKind {
        TokenKind::Integer(other)
    }
}

impl From<f64> for TokenKind {
    fn from(other: f64) -> TokenKind {
        TokenKind::Decimal(other)
    }
}
#}

Tokenizing Individual Atoms

To make things easy, we'll break tokenizing up into little functions which take some string slice (&str) and spit out either a token or an error.


# #![allow(unused_variables)]
#fn main() {
fn tokenize_ident(data: &str) -> Result<(TokenKind, usize)> {
    // identifiers can't start with a number
    match data.chars().next() {
        Some(ch) if ch.is_digit(10) => bail!("Identifiers can't start with a number"),
        None => bail!(ErrorKind::UnexpectedEOF),
        _ => {},
    }

    let (got, bytes_read) = take_while(data, |ch| ch == '_' || ch.is_alphanumeric())?;

    // TODO: Recognise keywords using a `match` statement here.

    let tok = TokenKind::Identifier(got.to_string());
    Ok((tok, bytes_read))
}
#}

The take_while() function is just a helper which will call a closure on each byte, continuing until the closure no longer returns true.

It's pretty simple in that you just keep track of the current index, then afterwards convert everything from the start up to the index into a &str. Making sure to return the number of bytes consumed (that bit will be useful for later when we deal with spans).


# #![allow(unused_variables)]
#fn main() {
/// Consumes bytes while a predicate evaluates to true.
fn take_while<F>(data: &str, mut pred: F) -> Result<(&str, usize)>  
where F: FnMut(char) -> bool
{
    let mut current_index = 0;

    for ch in data.chars() {
        let should_continue = pred(ch);

        if !should_continue {
            break;
        }

        current_index += ch.len_utf8();
    }

    if current_index == 0 {
        Err("No Matches".into())
    } else {
        Ok((&data[..current_index], current_index))
    }
}
#}

Now lets test it! To make life easier, we'll create a helper macro which generates a test for us. We just need to pass in a test name and the function being tested, and an input string and expected output. Then the macro will do the rest.


# #![allow(unused_variables)]
#fn main() {
macro_rules! lexer_test {
    (FAIL: $name:ident, $func:ident, $src:expr) => {
        #[cfg(test)]
        #[test]
        fn $name() {
            let src: &str = $src;
            let func = $func;

            let got = func(src);
            assert!(got.is_err(), "{:?} should be an error", got);
        }
    };
    ($name:ident, $func:ident, $src:expr => $should_be:expr) => {
        #[cfg(test)]
        #[test]
        fn $name() {
            let src: &str = $src;
            let should_be = TokenKind::from($should_be);
            let func = $func;

            let (got, _bytes_read) = func(src).unwrap();
            assert_eq!(got, should_be, "Input was {:?}", src);
        }
    };
}
#}

Now a test to check tokenizing identifiers becomes trivial.


# #![allow(unused_variables)]
#fn main() {
lexer_test!(tokenize_a_single_letter, tokenize_ident, "F" => "F");
lexer_test!(tokenize_an_identifer, tokenize_ident, "Foo" => "Foo");
lexer_test!(tokenize_ident_containing_an_underscore, tokenize_ident, "Foo_bar" => "Foo_bar");
lexer_test!(FAIL: tokenize_ident_cant_start_with_number, tokenize_ident, "7Foo_bar");
lexer_test!(FAIL: tokenize_ident_cant_start_with_dot, tokenize_ident, ".Foo_bar");
#}

Note that the macro calls into() on the result for us. Because we've defined From<&'a str> for TokenKind, we can use "Foo" as shorthand for the output.

It'also fairly easy to tokenize integers, they're just a continuous string of digits. However if we also want to be able to deal with decimal numbers we need to accept something that may look like two integers separated by a dot. In this case we the predicate needs to keep track of how many .'s it has seen, returning false the moment it sees more than one.


# #![allow(unused_variables)]
#fn main() {
/// Tokenize a numeric literal.
fn tokenize_number(data: &str) -> Result<(TokenKind, usize)> {
    let mut seen_dot = false;

    let (decimal, bytes_read) = take_while(data, |c| {
        if c.is_digit(10) {
            true
        } else if c == '.' {
            if !seen_dot {
                seen_dot = true;
                true
            } else {
                false
            }
        } else {
            false
        }
    })?;

    if seen_dot {
        let n: f64 = decimal.parse()?;
        Ok((TokenKind::Decimal(n), bytes_read))
    } else {
        let n: usize = decimal.parse()?;
        Ok((TokenKind::Integer(n), bytes_read))

    }
}
#}

Something interesting with this approach is that a literal like 12.4.789 will be lexed as the decimal 12.4 followed by a .789, which is an invalid float.


# #![allow(unused_variables)]
#fn main() {
lexer_test!(tokenize_a_single_digit_integer, tokenize_number, "1" => 1);
lexer_test!(tokenize_a_longer_integer, tokenize_number, "1234567890" => 1234567890);
lexer_test!(tokenize_basic_decimal, tokenize_number, "12.3" => 12.3);
lexer_test!(tokenize_string_with_multiple_decimal_points, tokenize_number, "12.3.456" => 12.3);
lexer_test!(FAIL: cant_tokenize_a_string_as_a_decimal, tokenize_number, "asdfghj");
lexer_test!(tokenizing_decimal_stops_at_alpha, tokenize_number, "123.4asdfghj" => 123.4);
#}

One last utility we're going to need is the ability to skip past whitespace characters and comments. These will be implemented as two separate functions which are wrapped by a single skip().

Let's deal with whitespace first seeing as that's easiest.


# #![allow(unused_variables)]
#fn main() {
fn skip_whitespace(data: &str) -> usize {
    match take_while(data, |ch| ch.is_whitespace()) {
        Ok((_, bytes_skipped)) => bytes_skipped,
        _ => 0,
    }
}

#[test]
fn skip_past_several_whitespace_chars() {
    let src = " \t\n\r123";
    let should_be = 4;

    let num_skipped = skip_whitespace(src);
    assert_eq!(num_skipped, should_be);
}

#[test]
fn skipping_whitespace_when_first_is_a_letter_returns_zero() {
    let src = "Hello World";
    let should_be = 0;

    let num_skipped = skip_whitespace(src);
    assert_eq!(num_skipped, should_be);
}
#}

TODO: Tokenize string literals

According to the internets, a comment in Delphi can be written multiple ways.

Commenting Code

Delphi uses // for a single line comment and both {} and (**) for multiple line comments. Although you can nest different types of multiple line comments, it is recommended that you don't.

Compiler Directives - $

A special comment. Delphi compiler directives are in the form of {$DIRECTIVE}. Of interest for comments is using the $IFDEF compiler directive to remark out code.


# #![allow(unused_variables)]
#fn main() {
fn skip_comments(src: &str) -> usize {
    let pairs = [("//", "\n"), ("{", "}"), ("(*", "*)")];

    for &(pattern, matcher) in &pairs {
        if src.starts_with(pattern) {
            let leftovers = skip_until(src, matcher);
            return src.len() - leftovers.len();
        }
    }

    0
}

fn skip_until<'a>(mut src: &'a str, pattern: &str) -> &'a str {
    while !src.is_empty() && !src.starts_with(pattern) {
        let next_char_size = src.chars().next().expect("The string isn't empty").len_utf8();
        src = &src[next_char_size..];
    }

    &src[pattern.len()..]
}

macro_rules! comment_test {
    ($name:ident, $src:expr => $should_be:expr) => {
        #[cfg(test)]
        #[test]
        fn $name() {
            let got = skip_comments($src);
            assert_eq!(got, $should_be);
        }
    }
}

comment_test!(slash_slash_skips_to_end_of_line, "// foo bar { baz }\n 1234" => 19);
comment_test!(comment_skip_curly_braces, "{ baz \n 1234} hello wor\nld" => 13);
comment_test!(comment_skip_round_brackets, "(* Hello World *) asd" => 17);
comment_test!(comment_skip_ignores_alphanumeric, "123 hello world" => 0);
comment_test!(comment_skip_ignores_whitespace, "   (* *) 123 hello world" => 0);
#}

Lastly, we group the whitespace and comment skipping together seeing as they both do the job of skipping characters we don't care about.


# #![allow(unused_variables)]
#fn main() {
/// Skip past any whitespace characters or comments.
fn skip(src: &str) -> usize {
    let mut remaining = src;

    loop {
        let ws = skip_whitespace(remaining);
        remaining = &remaining[ws..];
        let comments = skip_comments(remaining);
        remaining = &remaining[comments..];

        if ws + comments == 0 {
            return src.len() - remaining.len();
        }
    }
}
#}

The Main Tokenizer Function

To tie everything together, we'll use a method which matches the next character against various patterns in turn. This is essentially just a big match statement which defers to the small tokenizer functions we've built up until now.


# #![allow(unused_variables)]
#fn main() {
/// Try to lex a single token from the input stream.
pub fn tokenize_single_token(data: &str) -> Result<(TokenKind, usize)> {
    let next = match data.chars().next() {
        Some(c) => c,
        None => bail!(ErrorKind::UnexpectedEOF),
    };

    let (tok, length) = match next {
        '.' => (TokenKind::Dot, 1),
        '=' => (TokenKind::Equals, 1),
        '+' => (TokenKind::Plus, 1),
        '-' => (TokenKind::Minus, 1),
        '*' => (TokenKind::Asterisk, 1),
        '/' => (TokenKind::Slash, 1),
        '@' => (TokenKind::At, 1),
        '^' => (TokenKind::Carat, 1),
        '(' => (TokenKind::OpenParen, 1),
        ')' => (TokenKind::CloseParen, 1),
        '[' => (TokenKind::OpenSquare, 1),
        ']' => (TokenKind::CloseSquare, 1),
        '0' ... '9' => tokenize_number(data).chain_err(|| "Couldn't tokenize a number")?,
        c @ '_' | c if c.is_alphabetic() => tokenize_ident(data)
            .chain_err(|| "Couldn't tokenize an identifier")?,
        other => bail!(ErrorKind::UnknownCharacter(other)),
    };

    Ok((tok, length))
}
#}

Now lets test it, in theory we should get identical results to the other tests written up til now.


# #![allow(unused_variables)]
#fn main() {
lexer_test!(central_tokenizer_integer, tokenize_single_token, "1234" => 1234);
lexer_test!(central_tokenizer_decimal, tokenize_single_token, "123.4" => 123.4);
lexer_test!(central_tokenizer_dot, tokenize_single_token, "." => TokenKind::Dot);
lexer_test!(central_tokenizer_plus, tokenize_single_token, "+" => TokenKind::Plus);
lexer_test!(central_tokenizer_minus, tokenize_single_token, "-" => TokenKind::Minus);
lexer_test!(central_tokenizer_asterisk, tokenize_single_token, "*" => TokenKind::Asterisk);
lexer_test!(central_tokenizer_slash, tokenize_single_token, "/" => TokenKind::Slash);
lexer_test!(central_tokenizer_at, tokenize_single_token, "@" => TokenKind::At);
lexer_test!(central_tokenizer_carat, tokenize_single_token, "^" => TokenKind::Carat);
lexer_test!(central_tokenizer_equals, tokenize_single_token, "=" => TokenKind::Equals);
lexer_test!(central_tokenizer_open_paren, tokenize_single_token, "(" => TokenKind::OpenParen);
lexer_test!(central_tokenizer_close_paren, tokenize_single_token, ")" => TokenKind::CloseParen);
lexer_test!(central_tokenizer_open_square, tokenize_single_token, "[" => TokenKind::OpenSquare);
lexer_test!(central_tokenizer_close_square, tokenize_single_token, "]" => TokenKind::CloseSquare);
#}

Tying It All Together

Now we can write the overall tokenizer function. However, because this process involves a lot of state, it'll be easier to encapsulate everything in its own type while still exposing a high-level tokenize() function to users.


# #![allow(unused_variables)]
#fn main() {
struct Tokenizer<'a> {
    current_index: usize,
    remaining_text: &'a str,
}

impl<'a> Tokenizer<'a> {
    fn new(src: &str) -> Tokenizer {
        Tokenizer {
            current_index: 0,
            remaining_text: src,
        }
    }

    fn next_token(&mut self) -> Result<Option<(TokenKind, usize, usize)>> {
        self.skip_whitespace();

        if self.remaining_text.is_empty() {
            Ok(None)
        } else {
            let start = self.current_index;
            let tok = self._next_token()
                .chain_err(|| ErrorKind::MessageWithLocation(self.current_index,
                    "Couldn't read the next token"))?;
            let end = self.current_index;
            Ok(Some((tok, start, end)))
        }
    }

    fn skip_whitespace(&mut self) {
        let skipped = skip(self.remaining_text);
        self.chomp(skipped);
    }

    fn _next_token(&mut self) -> Result<TokenKind> {
        let (tok, bytes_read) = tokenize_single_token(self.remaining_text)?;
        self.chomp(bytes_read);

        Ok(tok)
    }

    fn chomp(&mut self, num_bytes: usize) {
        self.remaining_text = &self.remaining_text[num_bytes..];
        self.current_index += num_bytes;
    }
}

/// Turn a string of valid Delphi code into a list of tokens, including the 
/// location of that token's start and end point in the original source code.
///
/// Note the token indices represent the half-open interval `[start, end)`, 
/// equivalent to `start .. end` in Rust.
pub fn tokenize(src: &str) -> Result<Vec<(TokenKind, usize, usize)>> {
    let mut tokenizer = Tokenizer::new(src);
    let mut tokens = Vec::new();

    while let Some(tok) = tokenizer.next_token()? {
        tokens.push(tok);
    }

    Ok(tokens)
}
#}

Because we also want to make sure the location of tokens are correct, testing this will be a little more involved. We essentially need to write up some (valid) Delphi code, manually inspect it, then make sure we get back exactly what we expect. Byte indices and all.


# #![allow(unused_variables)]
#fn main() {
#[cfg(test)]
#[test]
fn tokenize_a_basic_expression() {
    let src = "foo = 1 + 2.34";
    let should_be = vec![
        (TokenKind::from("foo"), 0, 3),
        (TokenKind::Equals, 4, 5),
        (TokenKind::from(1), 6, 7),
        (TokenKind::Plus, 8, 9),
        (TokenKind::from(2.34), 10, 14),
    ];

    let got = tokenize(src).unwrap();
    assert_eq!(got, should_be);
}

#[cfg(test)]
#[test]
fn tokenizer_detects_invalid_stuff() {
    let src = "foo bar `%^&\\";
    let index_of_backtick = 8;

    let err = tokenize(src).unwrap_err();
    match err.kind() {
        &ErrorKind::MessageWithLocation(loc, _) => assert_eq!(loc, index_of_backtick),
        other => panic!("Unexpected error: {}", other),
    }
}
#}

You'll probably notice that we're returning a TokenKind and a pair of integers inside a tuple, which isn't overly idiomatic. Idiomatic Rust would bundle these up into a more strongly typed tuple of TokenKind and Span, where a span corresponds to the start and end indices of the token.

The reason we do things slightly strangly is that we're using a CodeMap to manage all these Spans, so when the caller calls the tokenize() function it's their responsibility to insert these token locations into a CodeMap. By returning a plain tuple of integers it means we can defer dealing with the CodeMap until later on. Vastly simplifying the tokenizing code.

For completeness though, here is the Token people will be using. We haven't created any in this module, but it makes sense for its definition to be here.


# #![allow(unused_variables)]
#fn main() {
/// A valid Delphi source code token.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct Token {
    /// The token's location relative to the rest of the files being 
    /// processed.
    pub span: Span,
    /// What kind of token is this?
    pub kind: TokenKind,
}

impl Token {
    /// Create a new token out of a `Span` and something which can be turned 
    /// into a `TokenKind`.
    pub fn new<K: Into<TokenKind>>(span: Span, kind: K) -> Token {
        let kind = kind.into();
        Token { span, kind }
    }
}

impl<T> From<T> for Token 
where T: Into<TokenKind> {
    fn from(other: T) -> Token {
        Token::new(Span::dummy(), other)
    }
}
#}

And that's about it for lexical analysis. We've now got the basic building blocks of a compiler/static analyser, and are able to move onto the next step... Actually making sense out of all these tokens!

The Parsing Stage

The Core part of the parsing stage is our Parser. This is what actually converts the tokens from a single file into an Abstract Syntax Tree which we can analyse.


# #![allow(unused_variables)]
#fn main() {
//! Parse a stream of `Tokens` into an *Abstract Syntax Tree* we can use for
//! the later steps.
#![allow(missing_docs, dead_code, unused_imports)]

#[macro_use]
mod macros;
mod parser;
pub use self::parser::Parser;
#}

The other important datastructure in this module is the Abstract Syntax Tree and its various types of nodes.


# #![allow(unused_variables)]
#fn main() {
mod ast;
pub use self::ast::{Literal, LiteralKind, Ident, DottedIdent};
#}

If you are following along at home you'll probably want to keep the pages for both the Parser and the AST open in tabs so you can swap between them easily.

Macros

To make testing easier, we're going to create a tok!() macro which uses the magic of Into<Token> to intelligently create tokens.

/// Shorthand macro for generating a token from *anything* which can be 
/// converted into a `TokenKind`, or any of the `TokenKind` variants.
///
/// # Examples
///
/// ```
/// #[macro_use]
/// extern crate static_analyser;
///
/// # fn main() {
/// tok!(Dot);
/// tok!(123);
/// tok!(3.14);
/// tok!(OpenParen);
/// # }
/// ```
#[macro_export]
macro_rules! tok {
    ($thing:tt) =>  {
        {
            #[allow(unused_imports)]
            use $crate::lex::TokenKind::*;
            $crate::lex::Token::from($thing)
        }
    };
}

We also want to add some tests to make sure the code it expands to is sane. Amusingly enough, we're going to write a macro to help generate tests to test our tok!() macro. This helps sidestep the otherwise unnecessary boilerplate around creating loads of almost similar tests which may deal with different types and syntactic structures.


# #![allow(unused_variables)]
#fn main() {
#[cfg(test)]
mod tests {
    use codemap::Span;
    use lex::{Token, TokenKind};

    macro_rules! token_macro_test {
        ($name:ident, $from:tt => $to:expr) => {
            #[test]
            fn $name() {
                let got: Token = tok!($from);
                let should_be = Token::new(Span::dummy(), $to);

                assert_eq!(got, should_be);
            }
        }
    }

    token_macro_test!(tok_expands_to_dot, Dot => TokenKind::Dot);
    token_macro_test!(tok_expands_to_openparen, OpenParen => TokenKind::OpenParen);
    token_macro_test!(tok_expands_to_integer, 1234 => TokenKind::Integer(1234));
    token_macro_test!(tok_expands_to_decimal, 12.34 => TokenKind::Decimal(12.34));
    token_macro_test!(tok_expands_to_identifier, "Hello" => TokenKind::Identifier("Hello".to_string()));
}
#}

When used this way, macros have a tendency to give horrible error messages. To make sure this won't happen I tried to pass in a non-existent TokenKind variant and see what happens:

error[E0425]: cannot find value `DoesntExist` in this scope
--> src/parse/macros.rs:41:55
|
41 |     token_macro_test!(use_tok_with_nonexistent_thing, DoesntExist => TokenKind::Dot);
|                                                          ^^^^^^^^^^^ not found in this scope

I'm fairly happy with the results.

Parsing

Now that we've turned the source code into tokens we can construct a more computer-friendly representation for the program. This representation is often called an Abstract Syntax Tree because it's a high-level tree datastructure which reflects a program's syntax.

The General Idea

Before we start with parsing, lets look at an example chunk of Delphi code to get a feel for the language. A unit file is the basic building block of a Delphi program, analogous to a *.c file. The Main() function is typically elsewhere in GUI programs because an application's endpoint is typically managed by the GUI framework or IDE.

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
  Dialogs, StdCtrls;

type
  TForm1 = class(TForm)
    Label1: TLabel;      // The label we have added
    Button1: TButton;    // The button we have added
    procedure Button1Click(Sender: TObject);
  private
    { private declarations }
  public
    { public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

// The button action we have added
procedure TForm1.Button1Click(Sender: TObject);
begin
  Label1.Caption := 'Hello World';    // Label changed when button pressed
end;

end.

At a very high level, a unit file consists of a unit statement declaring the unit's name, followed by the interface (kinda like a *.h file) then an implementation section, before ending with a end..

There's a formal language used to express a language's grammar called Backus–Naur form. That previous paragraph would translate to something like the following:

file        = unit_decl interface implementation "end.";
unit_decl   = "unit" unit_name SEMICOLON;
unit_name   = WORD;
interface   = "interface" uses types vars;
uses        = "uses" WORD ("," WORD)* SEMICOLON
             | ε;
types       = "type" type_decl*;
vars        = "var" var_decl*;

With the terminals (WORD, SEMICOLON, and friends) being their usual selves.

Delphi has a pretty simple syntax, so we're going to use a standard recursive descent parser. This is just an object which has a method roughly corresponding to each rule in the language's grammar.

The Parser Object

As usual, before we can do anything else we're going to have to import a couple dependencies.


# #![allow(unused_variables)]
#fn main() {
use std::rc::Rc;

use lex::{Token, TokenKind};
use codemap::{Span, FileMap};
use parse::ast::{Literal, LiteralKind, Ident, DottedIdent};
use errors::*;
#}

The Parser itself just contains the tokens and their corresponding FileMap.


# #![allow(unused_variables)]
#fn main() {
/// A parser for turning a stream of tokens into a Abstract Syntax Tree.
#[derive(Debug)]
pub struct Parser {
  tokens: Vec<Token>,
  filemap: Rc<FileMap>,
  current_index: usize,
}

impl Parser {
  /// Create a new parser.
  pub fn new(tokens: Vec<Token>, filemap: Rc<FileMap>) -> Parser {
    let current_index = 0;
    Parser { tokens, filemap, current_index }
  }

  /// Peek at the current token.
  fn peek(&self) -> Option<&TokenKind> {
    self.tokens.get(self.current_index).map(|t| &t.kind)
  }

  /// Get the current token, moving the index along one.
  fn next(&mut self) -> Option<&Token> {
    let tok = self.tokens.get(self.current_index);

    if tok.is_some() {
      self.current_index += 1;
    }

    tok
  }
}
#}

We'll implement the various grammar rules from the bottom up. Meaning we'll start with the very basics like expressions, then build things up until we get to the overall program.

First up lets have a go at parsing Literals. We do it in two steps, first you peek at the next token to make sure it's a kind you expect, then you unpack the token and convert it into it's equivalent AST node. A lot of the pattern matching boilerplate can be minimised with the judicious use of macros.


# #![allow(unused_variables)]
#fn main() {
impl Parser {
  fn parse_literal(&mut self) -> Result<Literal> {
    match self.peek() {
      Some(&TokenKind::Integer(_)) | 
      Some(&TokenKind::Decimal(_)) | 
      Some(&TokenKind::QuotedString(_)) => {},
      Some(_) => bail!("Expected a literal"),
      None => bail!(ErrorKind::UnexpectedEOF),
    };

    let next = self.next().expect("unreachable");
    let lit_kind = match next.kind {
      TokenKind::Integer(i) => LiteralKind::Integer(i),
      TokenKind::Decimal(d) => LiteralKind::Decimal(d),
      TokenKind::QuotedString(ref s) => LiteralKind::String(s.clone()),
      ref other => panic!("Unreachable token kind: {:?}", other),
    };

    Ok(Literal {
      span: next.span,
      kind: lit_kind
    })
  }
}
#}

Like the tokenizing module, we're going to need to write lots of tests to check our parser recognises things as we expect them to. Unfortunately the types and syntactic structures used will be slightly different, so we'll use macros to abstract away a lot of the boilerplate.


# #![allow(unused_variables)]
#fn main() {
macro_rules! parser_test {
  ($name:ident, $method:ident, $src:expr => $should_be:expr) =>  {
    #[cfg(test)]
    #[test]
    fn $name() {
      // create a codemap and tokenize our input string
      let mut codemap = $crate::codemap::CodeMap::new();
      let filemap = codemap.insert_file("dummy.pas", $src);
      let tokenized = $crate::lex::tokenize(filemap.contents())
        .chain_err(|| "Tokenizing failed")
        .unwrap();
      let tokens = filemap.register_tokens(tokenized);

      let should_be = $should_be;

      let mut parser = Parser::new(tokens, filemap);
      let got = parser.$method().unwrap();

      assert_eq!(got, should_be);
    }
  }
}
#}

Now we have our basic test harness set up, lets see if literal parsing works.


# #![allow(unused_variables)]
#fn main() {
parser_test!(integer_literal, parse_literal, "123" => LiteralKind::Integer(123));
parser_test!(parse_float_literal, parse_literal, "12.3" => LiteralKind::Decimal(12.3));
// TODO: re-enable this when string parsing is implemented
// parser_test!(parse_string_literal, parse_literal, "'12.3'" => LiteralKind::String("12.3".to_string()));
#}

Another easy thing to implement is parsing identifiers and dotted identifiers (e.g. foo.bar.baz). To recognise a dotted identifier you first look for one identifier, then keep trying to take a pair of dots and idents until there are no more.


# #![allow(unused_variables)]
#fn main() {
impl Parser {
  fn parse_ident(&mut self) -> Result<Ident> {
    match self.peek() {
      Some(&TokenKind::Identifier(_)) => {},
      _ => bail!("Expected an identifier"),
    }

    let next = self.next().unwrap();

    if let TokenKind::Identifier(ref ident) = next.kind {
      Ok(Ident {
        span: next.span,
        name: ident.clone(),
      })
    } else {
      unreachable!()
    }
  }

  fn parse_dotted_ident(&mut self) -> Result<DottedIdent> {
    let first = self.parse_ident()?;
    let mut parts = vec![first];

    while self.peek() == Some(&TokenKind::Dot) {
      let _ = self.next();
      let next = self.parse_ident()?;
      parts.push(next);
    }

    // the span for a dotted ident should be the union of the spans for
    // each of its components.
    let span = parts.iter()
                    .skip(1)
                    .fold(parts[0].span, |l, r| self.filemap.merge(l, r.span));

    Ok(DottedIdent { span, parts })    
  }
}
#}

Using our parser_test!() macro makes these a piece of cake to test.


# #![allow(unused_variables)]
#fn main() {
parser_test!(parse_a_basic_ident, parse_ident, "foo" => "foo");
parser_test!(parse_a_dotted_ident, parse_dotted_ident, "foo.bar.baz" => ["foo", "bar", "baz"]);
parser_test!(parse_a_single_ident_as_dotted, parse_dotted_ident, "foo" => ["foo"]);
#}

The Abstract Syntax Tree


# #![allow(unused_variables)]
#![allow(missing_docs)]
#fn main() {
use codemap::Span;
#}

The most basic element of the AST is a Literal. These are either integer, float, or string literals.


# #![allow(unused_variables)]
#fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum LiteralKind {
    Integer(usize),
    Decimal(f64),
    String(String),
}

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct Literal {
    pub span: Span,
    pub kind: LiteralKind,
}
#}

We also want to add a couple From impls so creating a LiteralKind is easy to do.


# #![allow(unused_variables)]
#fn main() {
impl From<usize> for LiteralKind {
    fn from(other: usize) -> LiteralKind {
        LiteralKind::Integer(other)
    }
}

impl From<f64> for LiteralKind {
    fn from(other: f64) -> LiteralKind {
        LiteralKind::Decimal(other)
    }
}

impl PartialEq<LiteralKind> for Literal {
    fn eq(&self, other: &LiteralKind) -> bool {
        &self.kind == other
    }
}
#}

We also want to deal with identifiers and dot-separated identifiers.


# #![allow(unused_variables)]
#fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct Ident {
    pub span: Span,
    pub name: String,
}

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct DottedIdent {
    pub span: Span,
    pub parts: Vec<Ident>,
}

impl<'a> PartialEq<&'a str> for Ident {
    fn eq(&self, other: &&str) -> bool {
        &self.name == other
    }
}

impl<'a, T: AsRef<[&'a str]>> PartialEq<T> for DottedIdent {
    fn eq(&self, other: &T) -> bool {
        self.parts.iter()
            .zip(other.as_ref().iter())
            .all(|(l, r)| l == r)
    }
}
#}

Type Checking and Lowering to HIR


# #![allow(unused_variables)]
#fn main() {
//! Typechecking and basic control flow graph generation.
#}

TODO: Implement some basic type-checking and do "lowering" from the AST to a HIR.

Static Analysis


# #![allow(unused_variables)]
#fn main() {
//! Static analysis passes (i.e. the whole point).
#}

TODO: Create a trait for lints and passes.

The Driver


# #![allow(unused_variables)]

#fn main() {
use codemap::{CodeMap, FileMap, Span};
#}

The Driver contains a CodeMap and various other configuration settings required to run the analysis.


# #![allow(unused_variables)]
#fn main() {
/// The driver is in charge of orchestrating the whole analysis process and 
/// making sure all the bits and pieces integrate nicely.
#[derive(Debug)]
pub struct Driver {
    codemap: CodeMap,
}
#}

He has various methods to allow users to add files to be analysed, as well as other convenience methods for setting things up.


# #![allow(unused_variables)]
#fn main() {
impl Driver {
    /// Create a new driver.
    pub fn new() -> Driver {
        Driver {
            codemap: CodeMap::new(),
        }
    }

    /// Get access to the driver's `CodeMap`.
    pub fn codemap(&mut self) -> &mut CodeMap {
        &mut self.codemap
    }
}

impl Default for Driver {
    fn default() -> Driver {
        Driver::new()
    }
}
#}

Error Handling

This is just some code to hook error-chain up so we can use it for internal errors. Feel free to skip past this.


# #![allow(unused_variables)]
#fn main() {
//! Types and traits used for internal errors.

error_chain!{
    errors {
        /// Got to the end of the input stream but was expecting more.
        UnexpectedEOF {
            display("Unexpected EOF")
            description("Unexpected EOF")
        }

        /// Reached an unknown character while lexing.
        UnknownCharacter(ch: char) {
            display("Unknown Character, {:?}", ch)
            description("Unknown Character")
        }

        /// A message which corresponds to some location in the source code.
        MessageWithLocation(loc: usize, msg: &'static str) {
            display("{} at {}", msg, loc)
            description("Custom Error")
        }
    }

    foreign_links {
        Io(::std::io::Error) #[doc = "Wrapper around a `std::io::Error`"];
        Utf8(::std::str::Utf8Error) #[doc = "An error parsing data as UTF-8"];
        FloatParsing(::std::num::ParseFloatError) #[doc = "A float parsing error"];
        IntParsing(::std::num::ParseIntError) #[doc = "An integer parsing error"];
        
    }
}
#}

The CodeMap

A CodeMap gives you a central mapping from spans to their location in the group of files being analysed.

As usual, lets add in a couple imports and module-level documentation.


# #![allow(unused_variables)]
#fn main() {
//! A mapping from arbitrary locations and sections of source code to their
//! contents.

use std::collections::HashMap;
use std::ops::Range;
use std::rc::Rc;
use std::cmp;
use std::cell::RefCell;
use std::sync::atomic::{AtomicUsize, Ordering};
use lex::{Token, TokenKind};
#}

We start off with a Span. This is really just a wrapper around an integer, with the assumption that a span will always correspond to something in the CodeMap. This means using a span from one CodeMap with another will result in a panic if you are lucky, or silently give you garbage.


# #![allow(unused_variables)]
#fn main() {
/// A unique identifier pointing to a substring in some file.
///
/// To get back the original string this points to you'll need to look it up
/// in a `CodeMap` or `FileMap`. 
#[derive(Copy, Clone, Debug, PartialEq, Hash, Eq, Serialize, Deserialize)]
pub struct Span(usize);

impl Span {
    /// Returns the special "dummy" span, which matches anything. This should
    /// only be used internally to make testing easier.
    pub(crate) fn dummy() -> Span {
        Span(0)
    }
}
#}

For our purposes, the CodeMap will just contain a list of FileMaps. These keep track of their name, contents, and the mapping of spans to locations in that content.


# #![allow(unused_variables)]
#fn main() {
/// A mapping of `Span`s to the files in which they are located.
#[derive(Debug)]
pub struct CodeMap {
    next_id: Rc<AtomicUsize>,
    files: Vec<Rc<FileMap>>,
}

/// A mapping which keeps track of a file's contents and allows you to cheaply
/// access substrings of the original content.
#[derive(Clone, Debug)]
pub struct FileMap {
    name: String,
    contents: String,
    next_id: Rc<AtomicUsize>,
    items: RefCell<HashMap<Span, Range<usize>>>
}
#}

The codemap has a couple useful methods for adding new files and looking up the string corresponding to a span.


# #![allow(unused_variables)]
#fn main() {
impl CodeMap {
    /// Create a new, empty `CodeMap`.
    pub fn new() -> CodeMap {
        let next_id = Rc::new(AtomicUsize::new(1));
        let files = Vec::new();
        CodeMap { next_id, files }
    }

    /// Add a new file to the `CodeMap` and get back a reference to it.
    pub fn insert_file<C, F>(&mut self, filename: F, contents: C) -> Rc<FileMap> 
    where F: Into<String>,
          C: Into<String>,
    {
        let filemap = FileMap {
            name: filename.into(),
            contents: contents.into(),
            items: RefCell::new(HashMap::new()),
            next_id: Rc::clone(&self.next_id),
        };
        let fm = Rc::new(filemap);
        self.files.push(Rc::clone(&fm));

        fm
    }

    /// Get the substring that this `Span` corresponds to.
    pub fn lookup(&self, span: Span) -> &str {
        for filemap in &self.files {
            if let Some(substr) = filemap.lookup(span) {
                return substr;
            }
        }

        panic!("Tried to lookup {:?} but it wasn't in any \
            of the FileMaps... This is a bug!", span)
    }

    /// The files that this `CodeMap` contains.
    pub fn files(&self) -> &[Rc<FileMap>] {
        self.files.as_slice()
    }
}

impl Default for CodeMap {
    fn default() -> CodeMap {
        CodeMap::new()
    }
}
#}

You may have noticed that FileMap contains a RefCell<HashMap<_>>. This is because we want to pass around multiple pointers to a file mapping, yet still be able to add new spans if we want to. It also contains a reference to the parent CodeMap's counter so when we insert new spans into the FileMap they'll still get globally unique IDs.


# #![allow(unused_variables)]
#fn main() {
impl FileMap {
    /// Get the name of this `FileMap`.
    pub fn filename(&self) -> &str {
        &self.name
    }

    /// Get the entire content of this file.
    pub fn contents(&self) -> &str {
        &self.contents
    }

    /// Lookup a span in this `FileMap`.
    ///
    /// # Panics
    ///
    /// If the `FileMap`'s `items` hashmap contains a span, but that span 
    /// **doesn't** point to a valid substring this will panic. If you ever
    /// get into a situation like this then things are almost certainly FUBAR.
    pub fn lookup(&self, span: Span) -> Option<&str> {
        let range = match self.range_of(span) {
            Some(r) => r,
            None => return None,
        };

        match self.contents.get(range.clone()) {
            Some(substr) => Some(substr),
            None => panic!("FileMap thinks it contains {:?}, \
                but the range ({:?}) doesn't point to anything valid!", span, range),
        }
    }

    /// Get the range corresponding to this span.
    pub fn range_of(&self, span: Span) -> Option<Range<usize>> {
        self.items.borrow().get(&span).cloned() 
    }
}
#}

Users can freely add new spans to a FileMap, to do this we'll take in the start and end indices, create a new span ID by incrementing our counter, then we insert the new span and range into the items. In debug builds we'll do bounds checks, but it's an assumption that the start and end indices are both within bounds, and lie on valid codepoint boundaries.


# #![allow(unused_variables)]
#fn main() {
impl FileMap {
    /// Ask the `FileMap` to give you the span corresponding to the half-open
    /// interval `[start, end)`.
    ///
    /// # Panics
    ///
    /// In debug mode, this will panic if either `start` or `end` are outside
    /// the source code or if they don't lie on a codepoint boundary.
    ///
    /// It is assumed that the `start` and `indices` were originally obtained
    /// from the file's contents.
    pub fn insert_span(&self, start: usize, end: usize) -> Span {
        debug_assert!(self.contents.is_char_boundary(start), 
            "Start doesn't lie on a char boundary");
        debug_assert!(self.contents.is_char_boundary(end), 
            "End doesn't lie on a char boundary");
        debug_assert!(start < self.contents.len(), 
            "Start lies outside the content string");
        debug_assert!(end <= self.contents.len(), 
            "End lies outside the content string");

        let range = start..end;

        if let Some(existing) = self.reverse_lookup(&range) {
            return existing;
        }

        let span_id = self.next_id.fetch_add(1, Ordering::Relaxed);
        let span = Span(span_id);

        self.items.borrow_mut().insert(span, range);
        span
    }

    /// We don't want to go and add duplicate spans unnecessarily so we 
    /// iterate through all existing ranges to see if this one already
    /// exists. 
    fn reverse_lookup(&self, needle: &Range<usize>) -> Option<Span> {
        self.items.borrow()
            .iter()
            .find(|&(_, range)| range == needle)
            .map(|(span, _)| span)
            .cloned()
    }

    /// Merge two spans to get the span which includes both.
    ///
    /// As usual, the constraints from `insert_span()` also apply here. If
    /// you try to enter two spans from different `FileMap`s, it'll panic.
    pub fn merge(&self, first: Span, second: Span) -> Span {
        let range_1 = self.range_of(first).expect("Can only merge spans from the same FileMap");
        let range_2 = self.range_of(second).expect("Can only merge spans from the same FileMap");

        let start = cmp::min(range_1.start, range_2.start);
        let end = cmp::max(range_1.end, range_2.end);

        self.insert_span(start, end)
    }
}
#}

To help after the tokenizing step, lets add a method which will take a bunch of tokens and register them with a FileMap. The same caveats as with insert_span() will apply here.


# #![allow(unused_variables)]
#fn main() {
impl FileMap {
    /// Register a set of tokenized inputs and turn them into a proper stream
    /// of tokens. Note that all the caveats from `insert_span()` also apply 
    /// here.
    pub fn register_tokens(&self, tokens: Vec<(TokenKind, usize, usize)>) -> Vec<Token> {
        let mut registered = Vec::new();

        for (kind, start, end) in tokens {
            let span = self.insert_span(start, end);
            let token = Token::new(span, kind);
            registered.push(token);
        }

        registered
    }
}
#}

To test that our CodeMap and FileMap behave as we expect them to, let's create some dummy "files" and try to create spans in them.


# #![allow(unused_variables)]
#fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn insert_a_file_into_a_codemap() {
        let mut map = CodeMap::new();
        let filename = "foo.rs";
        let content = "Hello World!";

        assert_eq!(map.files.len(), 0);
        let fm = map.insert_file(filename, content);

        assert_eq!(fm.filename(), filename);
        assert_eq!(fm.contents(), content);
        assert_eq!(map.files.len(), 1);
    }

    #[test]
    fn get_span_for_substring() {
        let mut map = CodeMap::new();
        let src = "Hello World!";
        let fm = map.insert_file("foo.rs", src);

        let start = 2;
        let end = 5;
        let should_be = &src[start..end];

        let span = fm.insert_span(start, end);
        let got = fm.lookup(span).unwrap();
        assert_eq!(got, should_be);
        assert_eq!(fm.range_of(span).unwrap(), start..end);

        let got_from_codemap = map.lookup(span);
        assert_eq!(got_from_codemap, should_be);
    }

    #[test]
    fn spans_for_different_ranges_are_always_unique() {
        let mut map = CodeMap::new();
        let src = "Hello World!";
        let fm = map.insert_file("foo.rs", src);

        let mut spans = Vec::new();

        for start in 0..src.len() {
            for end in start..src.len() {
                let span = fm.insert_span(start, end);
                assert!(!spans.contains(&span), 
                    "{:?} already contains {:?} ({}..{})", 
                    spans, span, start, end);
                assert!(span != Span::dummy());

                spans.push(span);
            }
        }
    }

    #[test]
    fn spans_for_identical_ranges_are_identical() {
        let mut map = CodeMap::new();
        let src = "Hello World!";
        let fm = map.insert_file("foo.rs", src);

        let start = 0;
        let end = 5;

        let span_1 = fm.insert_span(start, end);
        let span_2 = fm.insert_span(start, end);

        assert_eq!(span_1, span_2);
    }

    #[test]
    fn join_multiple_spans() {
        let mut map = CodeMap::new();
        let src = "Hello World!";
        let fm = map.insert_file("foo.rs", src);

        let span_1 = fm.insert_span(0, 2);
        let span_2 = fm.insert_span(3, 8);

        let joined = fm.merge(span_1, span_2);
        let equivalent_range = fm.range_of(joined).unwrap();

        assert_eq!(equivalent_range.start, 0);
        assert_eq!(equivalent_range.end, 8);
    }
}
#}