LALRPOP

LALRPOP is a parser generator, similar in principle to YACC, ANTLR, Menhir, and other such programs. In general, it has the grand ambition of being the most usable parser generator ever. This ambition is most certainly not fully realized: right now, it's fairly standard, maybe even a bit subpar in some areas. But hey, it's young. For the most part, this README is intended to describe the current behavior of LALRPOP, but in some places it includes notes for planned future changes.

Crash course on parsers

If you've never worked with a parser generator before, or aren't really familiar with context-free grammars, this section is just a very brief introduction into the basic idea. Basically a grammar is a nice way of writing out what kinds of inputs are legal. In our example, we want to support parenthesized numbers, so things like 123, (123), etc. We can express this with a simple grammar like:

Term = Num | "(" Term ")"

Here we say we are trying to parse a term, and a term can either be a number (Num) or some other term enclosing in parentheses (here I did not define what a number is, but in the real LALRPOP example we'll do that with a regular expression). Now imagine a potential input like ((123)). We can show how this would be parsed by writing out something called a "parse tree":

(  (  1  2  3  )  )
|  |  |     |  |  |
|  |  +-Num-+  |  |
|  |     |     |  |
|  |   Term    |  |
|  |     |     |  |
|  +---Term----+  |
|        |        |
+------Term-------+

Here you can see that we parsed ((123)) by finding a Num in the middle, calling that Num a Term, and matching up the parentheses to form two more terms on top of that.

Note that this parse tree is not a data structure but more a visualization of the parse. I mean, you can build up a parse tree as a data structure, but typically you don't want to: it is more detailed than you need. For example, you may not be that interested in the no-op conversion from a Num to a Term. The other weird thing about a parse tree is that it is intimately tied to your grammar, but often you have some existing data structures you would like to parse into -- so if you built up a parse tree, you'd then have to convert from the parse tree into those data structures, and that might be annoying.

Therefore, what a parser generator usually does, is instead let you choose how to represent each node in the parse tree, and how to do the conversions. You give each nonterminal a type, which can be any Rust type, and you write code that will execute each time a new node in the parse tree would have been constructed. In fact, in the examples that follow, we'll eventually build up something like a parse tree, but in the beginning, we won't do that at all. Instead, we'll represent each number and term as an i32, and we'll propagate this value around.

To make this a bit more concrete, here's a version of the grammar above written in LALRPOP notation (we'll revisit this again in more detail of course). You can see that the Term nonterminal has been given the type i32, and that each of the definitions has some code that follows a => symbol. This is the code that will execute to convert from the thing that was matched (like a number, or a parenthesized term) into an i32:

Term: i32 = {
    Num => /* ... number code ... */,
    "(" Term ")" => /* ... parenthesized code ... */,
};

OK, that's enough background, let's do this for real!

For getting started with LALRPOP, it's probably best if you read the tutorial, which will introduce you to the syntax of LALRPOP files and so forth.

But if you've done this before, or you're just the impatient sort, here is a quick 'cheat sheet' for setting up your project. First, add the following lines to your Cargo.toml:

# The generated code depends on lalrpop-util.
[dependencies]
lalrpop-util = "0.20.2"

# Add a build-time dependency on the lalrpop library:
[build-dependencies]
lalrpop = "0.20.2"
# If you are supplying your own external lexer you can disable default features so that the
# built-in lexer feature is not included
# lalrpop = { version = "0.20.2", default-features = false }

Next create a build.rs file that looks like:

fn main() {
    lalrpop::process_root().unwrap();
}

(If you already have a build.rs file, you should be able to just call process_root in addition to whatever else that file is doing.)

In this case, process_root simply uses the default settings, which takes files in src/ ending with the .lalrpop extension, and generates corresponding Rust source files with the same name in OUT_DIR. If you want to configure how LALRPOP executes, see the advanced setup section.

The [lalrpop_mod!][lalrpop_mod] macro generates a wrapper module in your crate so that you can use the generated parser from your code. For example, if the source grammar is located in grammar.lalrpop, adding the following line to lib.rs will create a corresponding grammar submodule (note that you can also add this line to a foo.rs module definition instead, which will then create a submodule foo::grammar):

#![allow(unused)]
fn main() {
lalrpop_mod!(grammar);

[lalrpop_mod]: https://docs.rs/lalrpop-util/latest/lalrpop_util/macro.lalrpop_mod.html

### Running manually

If you prefer, you can also run the `lalrpop` crate as an
executable. Simply run `cargo install lalrpop` and then you will get a
`lalrpop` binary you can execute, like so:

}

lalrpop file.lalrpop


This will generate `file.rs` for you. Note that it only executes if
`file.lalrpop` is newer than `file.rs`; if you'd prefer to execute
unconditionally, pass `-f` (also try `--help` for other options).

Cheatsheet

Users of Lalrpop have compiled the following cheatsheet table as a quick way to look up useful Lalrpop-isms. If you are looking for a specific piece of functionality, use this table to jump to the right section.

namesnippetdescriptiontutorial
position<left: @L> T <right: @R>captures the offset of the first byte and the offset of the last byte plus one (as left and right respectively)index pointer
error_recovery! => { ... }recovers from parser errorsError recovery
grammar_parametergrammar(scale: isize);input parameters usable in the generated parserPassing state parameter
custom_error"e" =>? Err(ParseError::User { error: "an error" })makes an action fallibleFallible actions
custom_macrosComma<T> = { ... }makes a non-terminal generic in other non-terminalsMacros
quantifier_macros<Num?> <Num*> <Num+>a non-terminal which can appear 0..1, 0+, 1+ timesMacros
tuple_macro<a:(<Num> ",")*>applies a quantifier to a group of matchesMacros
externextern { }allows to override some parts of the generated parserWriting a custom lexer
extern_errortype Error = MyError;sets the error to use in the ParseError::User variantWriting a custom lexer
extern_locationtype Location = MyLoc;sets the type to for locations instead of usizeWriting a custom lexer
extern_tokenum MyToken { }declares the type of lexer tokens to be consumed by the generated parserUsing tokens with references
auto_parameters<>refers to all the parameters of the non-terminal as a tupleType inference
conditional actionsExpr<I> = { ... , <T> if I == "a" => (), ...}Conditional definition of a macro's alternativeindex pointer
precedence#[precedence(level="0")]creates a hierarchy to parser actions for which ones should be applied firstHandling full expressions

This is a tutorial for how to write a complete parser for a simple calculator using LALRPOP.

If you are unfamiliar with what a parser generator is, you should read Crash course on parsers first.

This tutorial is still incomplete. There are a few example grammars in the doc section of the repository.

Here are some topics that I aim to cover when I get time to write about them:

  • Advice for resolving shift-reduce and reduce-reduce conflicts
  • Passing state and type/lifetime parameters to your action code (see e.g. this test invoked from here).
  • Location tracking with @L and @R (see e.g. this test).
  • Integrating with external tokenizers (see e.g. this test invoked from here).
  • Conditional macros (no good test to point you at yet, sorry)
  • Fallible action code that produces a Result (see e.g. this test invoked from here).
  • Converting to use LALR(1) instead of LR(1) (see e.g. this test invoked from here).
  • Plans for future features

Adding LALRPOP to your Cargo.toml file

LALRPOP works as a preprocessor that is integrated with cargo. When LALRPOP is invoked, it will search your source directory for files with the extension lalrpop and create corresponding rs files. So, for example, if we have a file calculator.lalrpop, the preprocessor will create a Rust file calculator.rs. As an aside, the syntax of LALRPOP intentionally hews fairly close to Rust, so it should be possible to use the Rust plugin to edit lalrpop files as well, as long as it's not too picky (the emacs rust-mode, in particular, works just fine).

To start, let's use cargo new to make a new project. We'll call it calculator:

> cargo new --bin calculator

We now have to edit the generated calculator/Cargo.toml file to invoke the LALRPOP preprocessor. The resulting file should look something like:

[package]
name = "calculator"
version = "0.1.0"
authors = ["Niko Matsakis <niko@alum.mit.edu>"]
edition = "2021"

[build-dependencies] # <-- We added this and everything after!
lalrpop = "0.20.2"

[dependencies]
lalrpop-util = { version = "0.20.2", features = ["lexer", "unicode"] }

Cargo can run build scripts as a pre-processing step, named build.rs by default. The [build-dependencies] section specifies the dependencies for build scripts -- in this case, just LALRPOP.

The [dependencies] section describes the dependencies that LALRPOP needs at runtime. All LALRPOP parsers require at least the lalrpop-util crate.

Next we have to add build.rs itself. For those unfamiliar with this feature, the build.rs file should be placed next to your Cargo.toml file and not inside the src folder with the rest of your Rust code. This should just look like the following:

fn main() {
    lalrpop::process_root().unwrap();
}

The function process_root processes your src directory, converting all lalrpop files into rs files, and saving them to OUT_DIR. It is smart enough to check timestamps and do nothing if the rs file is newer than the lalrpop file, and to mark the generated rs file as read-only. It returns an io::Result<()>, so the unwrap() call just asserts that no file-system errors occurred.

The lalrpop_mod! macro generates a wrapper module in your crate so that you can use the generated parser from your code. For example, if the source grammar is located in grammar.lalrpop, adding the following line to lib.rs will create a corresponding grammar submodule (note that you can also add this line to a foo.rs module definition instead, which will then create a submodule foo::grammar):

#![allow(unused)]
fn main() {
lalrpop_mod!(grammar);
}

Parsing parenthesized numbers

OK, now we're all set to start making a LALRPOP grammar. Before we tackle full expressions, let's start with something simple -- really simple. Let's just start with parenthesized integers, like 123 or (123) or even (hold on to your hats) (((123))). Wow.

To handle this, we'll need to add a calculator1.lalrpop as shown below. Note: to make explaining things easier, this version is maximally explicit; the next section will make it shorter by employing some shorthands that LALRPOP offers.

use std::str::FromStr;

grammar;

pub Term: i32 = {
    <n:Num> => n,
    "(" <t:Term> ")" => t,
};

Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();

Let's look at this bit by bit. The first part of the file is the use statement and the grammar declaration. You'll find these at the top of every LALRPOP grammar. Just as in Rust, the use statement just brings names in scope: in fact, these use statements are just copied verbatim into the generated Rust code as needed.

A note about underscores and hygiene: LALRPOP generates its own names that begin with at least two leading underscores. To avoid conflicts, it will insert more underscores if it sees that you use identifiers that also have two underscores. But if you use glob imports that bring in names beginning with __, you may find you have invisible conflicts. To avoid this, don't use a glob (or define some other name with two underscores somewhere else).

Nonterminal declarations. After the grammar declaration comes a series of nonterminal declarations. This grammar has two nonterminals, Term and Num. A nonterminal is just a name that we give to something which can be parsed. Each nonterminal is then defined in terms of other things.

Let's start with Num, at the end of the file, which is declared as follows:

Num: i32 =
    <s:r"[0-9]+"> => i32::from_str(s).unwrap();

This declaration says that the type of Num is i32. This means that when we parse a Num from the input text, we will produce a value of type i32. The definition of Num is <s:r"[0-9]+">. Let's look at this from the inside out. The notation r"[0-9]+" is a regex literal -- this is the same as a Rust raw string. (And, just as in Rust, you can use hashes if you need to embed quotes, like r#"..."..."#.) It will match against a string of characters that matches the regular expression: in this case, some number of digits. The result of this match will be a slice &'input str into the input text that we are parsing (no copies are made).

This regular expression is wrapped in angle brackets and labeled: <s:r"[0-9]+">. In general, angle brackets are used in LALRPOP to indicate the values that will be used by the action code -- that is, the code that executes when a Num is parsed. In this case, the string that matches the regular expression is bound to the name s, and the action code i32::from_str(s).unwrap() parses that string and creates an i32. Hence the result of parsing a Num is an i32.

OK, now let's look at the nonterminal Term:

pub Term: i32 = {
    <n:Num> => n,
    "(" <t:Term> ")" => t,
};

First, this nonterminal is declared as pub. That means that LALRPOP will generate a public struct (named, as we will see, TermParser) that you can use to parse strings as Term. Private nonterminals (like Num) can only be used within the grammar itself, not from outside.

The Term nonterminal has two alternative definitions, which is indicated by writing { alternative1, alternative2 }. In this case, the first alternative is <n:Num>, meaning that a term can be just a number; so 22 is a term. The second alternative is "(" <t:Term> ")", which indicates that a term can also be a parenthesized term; so (22) is a term, as is ((22)), ((((((22)))))), and so on.

Invoking the parser. OK, so we wrote our parser, how do we use it? For every nonterminal Foo declared as pub, LALRPOP will export a FooParser struct with a parse method that you can call to parse a string as that nonterminal. Here is a simple test that we've added to our main.rs file which uses this struct to test our Term nonterminal:

#![allow(unused)]
fn main() {
use lalrpop_util::lalrpop_mod;

lalrpop_mod!(pub calculator1); // synthesized by LALRPOP

#[test]
fn calculator1() {
    assert!(calculator1::TermParser::new().parse("22").is_ok());
    assert!(calculator1::TermParser::new().parse("(22)").is_ok());
    assert!(calculator1::TermParser::new().parse("((((22))))").is_ok());
    assert!(calculator1::TermParser::new().parse("((22)").is_err());
}
}

The full signature of the parse method looks like this:

#![allow(unused)]
fn main() {
fn parse<'input>(&self, input: &'input str)
                     -> Result<i32, ParseError<usize,(usize, &'input str),()>>
                     //        ~~~  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                     //         |                       |
                     // Result upon success             |
                     //                                 |
                     //             Error enum defined in the lalrpop_util crate
{
    ...
}
}

Type inference

OK, now that we understand the calculator1 example, let's look at some of the shorthands that LALRPOP offers to make it more concise. This code is found in the calculator2 demo.

To start, let's look at the definition of Term we saw before:

pub Term: i32 = {
    <n:Num> => n,
    "(" <t:Term> ")" => t,
};

The action code here is somewhat interesting. In both cases, it's not doing any new work, it's just selecting a value that was produced by another nonterminal. This turns out to be pretty common. So common, in fact, that LALRPOP offers some shorthand notation for it. Here is the definition of Term from the calculator2 demo:

pub Term = { Num, "(" <Term> ")" };

Here, we have no action code at all. If there is no action code, LALRPOP synthesizes action code which just takes the value of the things being matched. In the case of the first alternative, Num, there is only one thing being matched, so that means that Term will produce the same value as the Num we parsed, whatever that was.

In the case of the second alternative, "(" <Term> ")", there are three things being matched. Here we use the angle brackets to select which item(s) we want to take the value of --- we selected only one, so the result is that we take the value of the Term we parsed. If we selected more than one, the result would be a tuple of all the selected items. If we did not select any (i.e., "(" Term ")"), the result would be a tuple of all the items, and hence the result would be of type (&'input str, i32, &'input str).

Speaking of types, you may have noticed that Term has no type annotation. Since we didn't write out own action code, we can omit the type annotation and let LALRPOP infer it for us. In this case, LALRPOP can see that Term must have the same type as Num, and hence that the type must be i32.

OK, let's look at the definition of Num we saw before from calculator1:

Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();

This definition too can be made somewhat shorter. In calculator2, you will find:

Num: i32 = r"[0-9]+" => i32::from_str(<>).unwrap();

Here, instead of giving the regular expression a name s, we modified the action code to use the funky expression <>. This is a shorthand that says "synthesize names for the matched values and insert a comma-separated list here". In this case, there is only one matched value, r"[0-9]+", and it produces a &'input str, so LALRPOP will insert a synthetic variable for that value. Note that we still have custom action code, so we still need a type annotation.

To control what values are selected when you use the <> expression in your action code, you can use angle brackets as we saw before. Here are some examples of alternatives and how they are expanded to give you the idea:

AlternativeEquivalent to
A => bar(<>)<a:A> => bar(a)
A B => bar(<>)<a:A> <b:B> => bar(a, b)
A B => (<>)<a:A> <b:B> => (a, b)
<A> B => bar(<>)<a:A> B => bar(a)
<p:A> B => bar(<>)<p:A> B => bar(p)
<A> <B> => bar(<>)<a:A> <b:B> => bar(a, b)
<p:A> <q:B> => bar(<>)<p:A> <q:B> => bar(p, q)
<p:A> B => Foo {<>}<p:A> B => Foo {p:p}
<p:A> <q:B> => Foo {<>}<p:A> <q:B> => Foo {p:p, q:q}

The <> expressions also works with struct constructors (like Foo {...} in examples above). This works out well if the names of your parsed values match the names of your struct fields.

Handling full expressions

Now we are ready to extend our calculator to cover the full range of arithmetic expressions (well, at least the ones you learned in elementary school). Here is the next calculator example, calculator3:

use std::str::FromStr;

grammar;

pub Expr: i32 = {
    <l:Expr> "+" <r:Factor> => l + r,
    <l:Expr> "-" <r:Factor> => l - r,
    Factor,
};

Factor: i32 = {
    <l:Factor> "*" <r:Term> => l * r,
    <l:Factor> "/" <r:Term> => l / r,
    Term,
};

Term: i32 = {
    Num,
    "(" <Expr> ")",
};

Num: i32 = {
    r"[0-9]+" => i32::from_str(<>).unwrap(),
};

Perhaps the most interesting thing about this example is the way it encodes precedence. The idea of precedence of course is that in an expression like 2+3*4, we want to do the multiplication first, and then the addition. It is pretty straightforward to express precedence in your grammar by structuring it in tiers -- for example, here we have the nonterminal Expr, which covers all expressions. It consists of a series of factors that are added or subtracted from one another. A Factor is then a series of terms that are multiplied or divided. Finally, a Term is either a single number or, using parenthesis, an entire expr.

Abstracting from this example, the typical pattern for encoding precedence is to have one nonterminal per precedence level, where you begin with the operators of lowest precedence (+, -), add in the next highest precedence level (*, /), and finish with the bare "atomic" expressions like Num. Finally, you add in a parenthesized version of your top-level as an atomic expression, which lets people reset.

To see why this works, consider the two possible parse trees for something like 2+3*4:

2 + 3   *    4          2   +  3   *    4
| | |   |    |          |   |  |   |    |
| | +-Factor-+    OR    +-Expr-+   |    |
| |     |                   |      |    |
+-Expr -+                   +----Factor-+

In the first one, we give multiplication higher precedence, and in the second one, we (incorrectly) give addition higher precedence. If you look at the grammar now, you can see that the second one is impossible: a Factor cannot have an Expr as its left-hand side. This is the purpose of the tiers: to force the parser into the precedence you want.

Tiered expressions can also be generated by Lalrpop using the precedence and assoc attribute macros. These macros generate the same thing as tiered expressions do, but they can reduce code complexity when working with many levels of precedence. The above Expr grammar can be rewritten to the following using them:

pub Expr: i32 = {
    #[precedence(level="0")] // Highest precedence
    Term,
    #[precedence(level="1")] #[assoc(side="left")]
    <l:Expr> "*" <r:Expr> => l * r,
    <l:Expr> "/" <r:Expr> => l / r,
    #[precedence(level="2")] #[assoc(side="left")]
    <l:Expr> "+" <r:Expr> => l + r,
    <l:Expr> "-" <r:Expr> => l - r,
};

The precedence level specifies the order of operations starting from zero. In this example it means that 13 + 7 * 3 would be evaluated as (13 + (7 * 3)) because multiplication has a lower precedence level than addition.

By using assoc you can specify if the expression is left-associative or right-associative. This is required to make the grammar unambiguous, otherwise 1 + 2 + 3 could both be interpreted as (1 + (2 + 3)) and ((1 + 2) + 3).

Finally, note that we only write pub before the nonterminal we're interested in parsing (Expr) and not any of the helpers. Nonterminals marked pub have extra code generated, like the new() method used to access the parser from other modules. If you get a warning about an unused new() method on FooParser, drop the pub from nonterminal Foo.

Building ASTs

Of course, most of the time, when you're parsing you don't want to compute a value, you want to build up some kind of data structure. Here's a quick example to show how that is done in LALRPOP. First, we need to define the data structure we will build. We're going to use a very simple enum:

#![allow(unused)]
fn main() {
pub enum Expr {
    Number(i32),
    Op(Box<Expr>, Opcode, Box<Expr>),
}

pub enum Opcode {
    Mul,
    Div,
    Add,
    Sub,
}
}

We put this code into an ast.rs module in our project, along with some Debug impls so that things pretty-print nicely. Now we will create the calculator4 example, which will build up this tree. To start, let's just look at the Expr nonterminal, which will show you most everything of how it is done (the most interesting lines have been flagged with comments):

use std::str::FromStr;
use ast::{Expr, Opcode}; // (0)

grammar;

pub Expr: Box<Expr> = { // (1)
    Expr ExprOp Factor => Box::new(Expr::Op(<>)), // (2)
    Factor,
};

ExprOp: Opcode = { // (3)
    "+" => Opcode::Add,
    "-" => Opcode::Sub,
};

First off, we have to import these new names into our file by adding a use statement (0). Next, we want to produce Box<Expr> values, so we change the type of Expr (and Factor and Term) to Box<Expr> (1). The action code changes accordingly in (2); here we've used the <> expansion to supply three arguments to Expr::Op. Finally, just for concision, we introduced an ExprOp nonterminal (3) to cover the two opcodes, which now trigger the same action code (before they triggered different action code, so we could do an addition vs a subtraction).

The definition of Factor is transformed in a similar way:

Factor: Box<Expr> = {
    Factor FactorOp Term => Box::new(Expr::Op(<>)),
    Term,
};

FactorOp: Opcode = {
    "*" => Opcode::Mul,
    "/" => Opcode::Div,
};

And finally we adjust the definitions of Term and Num. Here, we convert from a raw i32 into a Box<Expr> when we transition from Num to Term (4):

Term: Box<Expr> = {
    Num => Box::new(Expr::Number(<>)), // (4)
    "(" <Expr> ")"
};

Num: i32 = {
    r"[0-9]+" => i32::from_str(<>).unwrap()
};

And that's it! Now we can test it by adding some code to our main.rs file that parses an expression and formats it using the Debug impl:

#![allow(unused)]
fn main() {
lalrpop_mod!(pub calculator4);
pub mod ast;

#[test]
fn calculator4() {
    let expr = calculator4::ExprParser::new()
        .parse("22 * 44 + 66")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "((22 * 44) + 66)");
}
}

Macros

Frequently when writing grammars we encounter repetitive constructs that we would like to copy-and-paste. A common example is defining something like a "comma-separated list". Imagine we wanted to parse a comma-separated list of expressions (with an optional trailing comma, of course). If we had to write this out in full, it would look something like:

Exprs: Vec<Box<Expr>> = {
    Exprs "," Expr => ...,
    Expr => vec![<>],
}

Of course, this doesn't handle trailing commas, and I've omitted the action code. If we added those, it would get a bit more complicated. So far, this is fine, but then what happens when we later want a comma-separated list of terms? Do we just copy-and-paste everything?

LALRPOP offers a better option. You can define macros. In fact, LALRPOP comes with four macros builtin: *, ?, +, and (...). So you can write something like Expr? to mean "an optional Expr". This will have type Option<Box<Expr>> (since Expr alone has type Box<Expr>). Similarly, you can write Expr* or Expr+ to get a Vec<Expr> (with minimum length 0 and 1 respectively). The final macro is parentheses, which is a shorthand for creating a new nonterminal. This lets you write things like (<Expr> ",")? to mean an "optionally parse an Expr followed by a comma". Note the angle brackets around Expr: these ensures that the value of the (<Expr> ",") is the value of the expression, and not a tuple of the expression and the comma. This means that (<Expr> ",")? would have the type Option<Box<Expr>> (and not Option<(Box<Expr>, &'input str)>).

Using these operations we can define Exprs in terms of a macro Comma<T> that creates a comma-separated list of T, whatever T is (this definition appears in calculator5):

pub Exprs = Comma<Expr>; // (0)

Comma<T>: Vec<T> = { // (1)
    <mut v:(<T> ",")*> <e:T?> => match e { // (2)
        None => v,
        Some(e) => {
            v.push(e);
            v
        }
    }
};

The definition of Exprs on line (0) is fairly obvious, I think. It just uses a macro Comma<Expr>. Let's take a look then at the definition of Comma<T> on line (1). This is sort of dense, so let's unpack it. First, T is some terminal or nonterminal, but note that we can also use it as a type: when the macro is expanded, the T in the type will be replaced with "whatever the type of T is".

Next, on (2), we parse <mut v:(<T> ",")*> <e:T?>. That's a lot of symbols, so let's first remove all the angle brackets, which just serve to tell LALRPOP what values you want to propagate and which you want to discard. In that case, we have: (T ",")* T?. Hopefully you can see that this matches a comma-separated list with an optional trailing comma. Now let's add those angle-brackets back in. In the parentheses, we get (<T> ",")* -- this just means that we keep the value of the T but discard the value of the comma when we build our vector. Then we capture that vector and call it v: <mut v:(<T> ",")*>. The mut makes v mutable in the action code. Finally, we capture the optional trailing element e: <e:T?>. This means the Rust code has two variables available to it, v: Vec<T> and e: Option<T>. The action code itself should then be fairly clear -- if e is Some, it appends it to the vector and returns the result.

As another example of using macros, you may recall the precedence tiers we saw in calculator4 (Expr, Factor, etc), which had a sort of repetitive structure. You could factor that out using a macro. In this case, it's a recursive macro:

Tier<Op,NextTier>: Box<Expr> = {
    Tier<Op,NextTier> Op NextTier => Box::new(Expr::Op(<>)),
    NextTier
};

Expr = Tier<ExprOp, Factor>;
Factor = Tier<FactorOp, Term>;

ExprOp: Opcode = { // (3)
    "+" => Opcode::Add,
    "-" => Opcode::Sub,
};

FactorOp: Opcode = {
    "*" => Opcode::Mul,
    "/" => Opcode::Div,
};

And, of course, we have to add some tests to main.rs file:

#![allow(unused)]
fn main() {
use lalrpop_util::lalrpop_mod;

lalrpop_mod!(pub calculator5);

#[test]
fn calculator5() {
    let expr = calculator5::ExprsParser::new().parse("").unwrap();
    assert_eq!(&format!("{:?}", expr), "[]");

    let expr = calculator5::ExprsParser::new()
        .parse("22 * 44 + 66")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * 44) + 66)]");

    let expr = calculator5::ExprsParser::new()
        .parse("22 * 44 + 66,")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * 44) + 66)]");

    let expr = calculator5::ExprsParser::new()
        .parse("22 * 44 + 66, 13*3")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * 44) + 66), (13 * 3)]");

    let expr = calculator5::ExprsParser::new()
        .parse("22 * 44 + 66, 13*3,")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * 44) + 66), (13 * 3)]");
}
}

Returning errors from actions

Sometimes it can be useful to have action code that is able to return an error instead of being expected to produce a value of type T directly. This happens because we usually cannot reject all invalid input just by using grammar rules, or rather, the work to do so would be too much.

Even in our calculator example, you can see that we're "cheating" the system: Our grammar accepts an unlimited number of digits in the input, but the result is parsed as a i32. This is an issue because the maximum number that i32 can represent is 2147483647. Try giving it a bigger number and it will panic because it always expects the i32 conversion to succeed.

If you are familiar with Rust's error handling story, you might think that we can just make Num return an Option<i32> or even Result<i32, ...>, and you would be right. However, that is not necessary, because if we look at the type of ExprParser::parse(), we can see that it already returns a Result<i32, ParseError>. So the goal is to "hook" into this existing error machinery and create action code that can return errors.

LALRPOP supports this very easily by defining action code with =>? instead of =>. The returned value is then assumed to be a Result<T, ParseError> instead of a plain T:

Num: i32 = {
    r"[0-9]+" =>? i32::from_str(<>)
        .map_err(|_| ParseError::User {
            error: "number is too big"
        })
};

In addition, we have to add use lalrpop_util::ParseError; to the top of the file so that we have access to the ParseError type. You can find the full source as calculator6.lalrpop. This allows you to nicely handle the errors:

#![allow(unused)]
fn main() {
use lalrpop_util::lalrpop_mod;

lalrpop_mod!(pub calculator6);

#[test]
fn calculator6() {
    // Number is one bigger than std::i32::MAX
    let expr = calculator6::ExprsParser::new().parse("2147483648");
    assert!(expr.is_err());
}
}

No panics!

You can even go a step further and define your own error type, for example an enum with all possible errors. This allows you to distinguish between different errors more easily, without relying on strings.

For that, let's say we want to define two errors: One if the input number was too big, and another one if the input number was not even - we're changing the calculator to only accept even numbers for now.

We first define our error enum in main.rs:

#![allow(unused)]
fn main() {
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub enum Calculator6Error {
    InputTooBig,
    OddNumber,
}
}

Then we import it into our grammar and tell LALRPOP to use it as the user error type, so we change the top of the file to:

use std::str::FromStr;
use ast::{Expr, Opcode};

use super::Calculator6Error;

use lalrpop_util::ParseError;

grammar;

extern {
    type Error = Calculator6Error;
}
...

We can also change the rule for Num to make use of our new error:

Num: i32 = {
    r"[0-9]+" =>? i32::from_str(<>)
        .map_err(|_| ParseError::User {
            error: Calculator6Error::InputTooBig
        })
        .and_then(|i| if i % 2 == 0 {
            Ok(i)
        } else {
            Err(ParseError::User {
                error: Calculator6Error::OddNumber
            })
        })
};

And finally we can see if it works:

#![allow(unused)]
fn main() {
use lalrpop_util::lalrpop_mod;

lalrpop_mod!(pub calculator6b);

#[test]
fn calculator6b() {
    use lalrpop_util::ParseError;

    let expr = calculator6b::ExprsParser::new().parse("2147483648");
    assert!(expr.is_err());
    assert_eq!(expr.unwrap_err(), ParseError::User { error: Calculator6Error::InputTooBig });

    let expr = calculator6b::ExprsParser::new().parse("3");
    assert!(expr.is_err());
    assert_eq!(expr.unwrap_err(), ParseError::User { error: Calculator6Error::OddNumber });
}
}

There we go! You can find the full grammar in calculator6b.lalrpop.

Error recovery

By default, the parser will stop as soon as it encounters an error. Sometimes though we would like to try and recover and keep going. LALRPOP can support this, but you have to help it by defining various "error recovery" points in your grammar. This is done by using a special ! token: this token only occurs when the parser encounters an error in the input. When an error does occur, the parser will try to recover and keep going; it does this by injecting the ! token into the stream, executing any actions that it can, and then dropping input tokens until it finds something that lets it continue.

Let's see how we can use error recovery to attempt to find multiple errors during parsing. First we need a way to return multiple errors as this is not something that LALRPOP does by itself so we add a Vec storing the errors we found during parsing. Since the result of ! contains a token, error recovery requires that tokens can be cloned. We need to replace the begin "grammar" line of the LALRPOP file with this:

use lalrpop_util::ErrorRecovery;

grammar<'err>(errors: &'err mut Vec<ErrorRecovery<usize, Token<'input>, &'static str>>);

The ErrorRecovery struct wraps ParseError to add a second field referencing the skipped characters.

Since an alternative containing ! is expected to return the same type of value as the other alternatives in the production we add an extra variant to Expr to indicate that an error was found.

#![allow(unused)]
fn main() {
pub enum Expr {
    Number(i32),
    Op(Box<Expr>, Opcode, Box<Expr>),
    Error,
}
}

Finally we modify the grammar, adding a third alternative containing ! which simply stores the ErrorRecovery value received from ! in errors and returns an Expr::Error. The value of the error token will be a ParseError value. You can find the full source in calculator7.

Term: Box<Expr> = {
    Num => Box::new(Expr::Number(<>)),
    "(" <Expr> ")",
    ! => { errors.push(<>); Box::new(Expr::Error) },
};

Now we can add a test that includes various errors (e.g., missing operands). Note that now the parse method takes two arguments instead of one, which is caused by that we rewrote the "grammar" line in the LALRPOP file. You can see that the parser recovered from missing operands by inserting this ! token where necessary.

#![allow(unused)]
fn main() {
#[test]
fn calculator7() {
    let mut errors = Vec::new();

    let expr = calculator7::ExprsParser::new()
        .parse(&mut errors, "22 * + 3")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * error) + 3)]");

    let expr = calculator7::ExprsParser::new()
        .parse(&mut errors, "22 * 44 + 66, *3")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[((22 * 44) + 66), (error * 3)]");

    let expr = calculator7::ExprsParser::new()
        .parse(&mut errors, "*")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "[(error * error)]");

    assert_eq!(errors.len(), 4);
}
}

Passing state parameter

By default, the parser doesn't take any argument other than the input. When building the AST, it might be useful to pass parameters to the parser, which might be needed to the construction of the tree.

Going back to the calculator4 example it is possible to pass an argument to the parser :

#![allow(unused)]
fn main() {
grammar(scale: i32);
}
#![allow(unused)]
fn main() {
Num: i32 = {
    r"[0-9]+" => i32::from_str(<>).unwrap()*scale,
};
}

Here the parser will accept a scale parameter that will scale every number encountered.

We can then call the parser with the state parameter :

#![allow(unused)]
fn main() {
#[test]
fn calculator8() {
    let scale = 2;
    let expr = calculator8::ExprParser::new()
        .parse(scale,"11 * 22 + 33")
        .unwrap();
    assert_eq!(&format!("{:?}", expr), "((22 * 44) + 66)");
}
}

For a more practical example with a custom tree structure, check out this parser using this structure to build the AST.

Note: The state parameter must implement the Copy trait. For types that don't implement Copy, you should pass them as a reference instead.

Fine control over the lexer

This part is about controlling the inner workings of LALRPOP's built-in lexer generator and using your own hand written parser.

LALRPOP's lexer generator

This example dives a bit deeper into how LALRPOP works. In particular, it dives into the meaning of those strings and regular expression that we used in the previous tutorial, and how they are used to process the input string (a process which you can control). This first step of breaking up the input using regular expressions is often called lexing or tokenizing.

If you're comfortable with the idea of a lexer or tokenizer, you may wish to skip ahead to the calculator3 example, which covers parsing bigger expressions, and come back here only when you find you want more control. You may also be interested in the tutorial on writing a custom lexer.

Terminals vs nonterminals

You may have noticed that our grammar included two distinct kinds of symbols. There were the nonterminals, Term and Num, which we defined by specifying a series of symbols that they must match, along with some action code that should execute once they have matched:

   Num: i32 = r"[0-9]+" => i32::from_str(<>).unwrap();
// ~~~  ~~~   ~~~~~~~~~    ~~~~~~~~~~~~~~~~~~~~~~~~~~
// |    |     |                Action code
// |    |     Symbol(s) that should match
// |    Return type
// Name of nonterminal

But there are also terminals, which consist of the string literals and regular expressions sprinkled throughout the grammar. (Terminals are also often called tokens, and I will use the terms interchangeably.)

This distinction between terminals and nonterminals is very important to how LALRPOP works. In fact, when LALRPOP generates a parser, it always works in a two-phase process. The first phase is called the lexer or tokenizer. It has the job of figuring out the sequence of terminals: so basically it analyzes the raw characters of your text and breaks them into a series of terminals. It does this without having any idea about your grammar or where you are in your grammar. Next, the parser proper is a bit of code that looks at this stream of tokens and figures out which nonterminals apply:

          +-------------------+    +---------------------+
  Text -> | Lexer             | -> | Parser              |
          |                   |    |                     |
          | Applies regex to  |    | Consumes terminals, |
          | produce terminals |    | executes your code  |
          +-------------------+    | as it recognizes    |
                                   | nonterminals        |
                                   +---------------------+

LALRPOP's default lexer is based on regular expressions. By default, it works by extracting all the terminals (e.g., "(" or r"\d+") from your grammar and compiling them into one big list. At runtime, it will walk over the string and, at each point, find the longest match from the literals and regular expressions in your grammar and produces one of those. As an example, let's look again at our example grammar:

pub Term: i32 = {
    <n:Num> => n,
    "(" <t:Term> ")" => t,
};

Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();

This grammar in fact contains three terminals:

  • "(" -- a string literal, which must match exactly
  • ")" -- a string literal, which must match exactly
  • r"[0-9]+" -- a regular expression

When we generate a lexer, it is effectively going to be checking for each of these three terminals in a loop, sort of like this pseudocode:

let mut i = 0; // index into string
loop {
    skip whitespace; // we do this implicitly, at least by default
    if (data at index i is "(") { produce "("; }
    else if (data at index i is ")") { produce ")"; }
    else if (data at index i matches regex "[0-9]+") { produce r"[0-9]+"; }
}

Note that this has nothing to do with your grammar. For example, the tokenizer would happily tokenize a string like this one, which doesn't fit our grammar:

  (  22   44  )     )
  ^  ^^   ^^  ^     ^
  |  |    |   |     ")" terminal
  |  |    |   |
  |  |    |   ")" terminal
  |  +----+
  |  |
  |  2 r"[0-9]+" terminals
  |
  "(" terminal

When these tokens are fed into the parser, it would notice that we have one left paren but then two numbers (r"[0-9]+" terminals), and hence report an error.

Precedence of fixed strings

Terminals in LALRPOP can be specified (by default) in two ways. As a fixed string (like "(") or a regular expression (like r[0-9]+). There is actually an important difference: if, at some point in the input, both a fixed string and a regular expression could match, LALRPOP gives the fixed string precedence. To demonstrate this, let's modify our parser. If you recall, the current parser parses parenthesized numbers, producing a i32. We're going to modify if to produce a string, and we'll add an "easter egg" so that 22 (or (22), ((22)), etc) produces the string "Twenty-two":

pub Term = {
    Num,
    "(" <Term> ")",
    "22" => "Twenty-two!".to_string(),
};

Num: String = r"[0-9]+" => <>.to_string();

If we write some simple unit tests, we can see that in fact an input of 22 has matched the string literal. Interestingly, the input 222 matches the regular expression instead; this is because LALRPOP prefers to find the longest match first. After that, if there are two matches of equal length, it prefers the fixed string:

#![allow(unused)]
fn main() {
#[test]
fn calculator2b() {
    let result = calculator2b::TermParser::new().parse("33").unwrap();
    assert_eq!(result, "33");

    let result = calculator2b::TermParser::new().parse("(22)").unwrap();
    assert_eq!(result, "Twenty-two!");

    let result = calculator2b::TermParser::new().parse("(222)").unwrap();
    assert_eq!(result, "222");
}
}

Ambiguities between regular expressions

In the previous section, we saw that fixed strings have precedence over regular expressions. But what if we have two regular expressions that can match the same input? Which one wins? For example, consider this variation of the grammar above, where we also try to support parenthesized identifiers like ((foo22)):

pub Term = {
    Num,
    "(" <Term> ")",
    "22" => format!("Twenty-two!"),
    r"\w+" => format!("Id({})", <>), // <-- we added this
};

Num: String = r"[0-9]+" => <>.to_string();

Here I've written the regular expression r\w+. However, if you check out the docs for regex, you'll see that \w is defined to match alphabetic characters but also digits. So there is actually an ambiguity here: if we have something like 123, it could be considered to match either r"[0-9]+" or r"\w+". If you try this grammar, you'll find that LALRPOP helpfully reports an error:

error: ambiguity detected between the terminal `r#"\w+"#` and the terminal `r#"[0-9]+"#`

      r"\w+" => <>.to_string(),
      ~~~~~~

There are various ways to fix this. We might try adjusting our regular expression so that the first character cannot be a number, so perhaps something like r"[[:alpha:]]\w*". This will work, but it actually matches something different than what we had before (e.g., 123foo will not be considered to match, for better or worse). And anyway it's not always convenient to make your regular expressions completely disjoint like that. Another option is to use a match declaration, which lets you control the precedence between regular expressions.

Simple match declarations

A match declaration lets you explicitly give the precedence between terminals. In its simplest form, it consists of just ordering regular expressions and string literals into groups, with the higher precedence items coming first. So, for example, we could resolve our conflict above by giving r"[0-9]+" precedence over r"\w+", thus saying that if something can be lexed as a number, we'll do that, and otherwise consider it to be an identifier.

match {
    r"[0-9]+"
} else {
    r"\w+",
    _
}

Here the match contains two levels; each level can have more than one item in it. The top-level contains only r"[0-9]+", which means that this regular expression is given highest priority. The next level contains r\w+, so that will match afterwards.

The final _ indicates that other string literals and regular expressions that appear elsewhere in the grammar (e.g., "(" or "22") should be added into that final level of precedence (without an _, it is illegal to use a terminal that does not appear in the match declaration).

If we add this match section into our example, we'll find that it compiles, but it doesn't work exactly like we wanted. Let's update our unit test a bit to include some identifier examples::

#![allow(unused)]
fn main() {
#[test]
fn calculator2b() {
    // These will all work:

    let result = calculator2b::TermParser::new().parse("33").unwrap();
    assert_eq!(result, "33");

    let result = calculator2b::TermParser::new().parse("foo33").unwrap();
    assert_eq!(result, "Id(foo33)");

    let result = calculator2b::TermParser::new().parse("(foo33)").unwrap();
    assert_eq!(result, "Id(foo33)");

    // This one will fail:

    let result = calculator2b::TermParser::new().parse("(22)").unwrap();
    assert_eq!(result, "Twenty-two!");
}
}

The problem comes about when we parse 22. Before, the fixed string 22 got precedence, but with the new match declaration, we've explicitly stated that the regular expression r"[0-9]+" has full precedence. Since the 22 is not listed explicitly, it gets added at the last level, where the _ appears. We can fix this by adjusting our match to mention 22 explicitly:

match {
    r"[0-9]+",
    "22"
} else {
    r"\w+",
    _
}

This raises the interesting question of what the precedence is within a match rung -- after all, both the regex and "22" can match the same string. The answer is that within a match rung, fixed literals get precedence over regular expressions, just as before, and all regular expressions must not overlap.

With this new match declaration, we will find that our tests all pass.

Renaming match declarations

There is one final twist before we reach the final version of our example that you will find in the repository. We can also use match declarations to give names to regular expressions, so that we don't have to type them directly in our grammar. For example, maybe instead of writing r"\w+", we would prefer to write ID. We could do that by modifying the match declaration like so:

match {
    r"[0-9]+",
    "22"
} else {
    r"\w+" => ID, // <-- give a name here
    _
}

And then adjusting the definition of Term to reference ID instead:

pub Term = {
    Num,
    "(" <Term> ")",
    "22" => "Twenty-two!".to_string(),
    ID => format!("Id({})", <>), // <-- changed this
};

In fact, the match declaration can map a regular expression to any kind of symbol you want (i.e., you can also map to a string literal or even a regular expression). Whatever symbol appears after the => is what you should use in your grammar. As an example, some languages have case-insensitive keywords; if you wanted to write "BEGIN" in the grammar itself, but have that map to a regular expression in the lexer, you might write:

match {
    r"(?i)begin" => "BEGIN",
    ...
}

And now any reference in your grammar to "BEGIN" will actually match any capitalization.

Customizing skipping between tokens

If we want to support comments we will need to skip more than just whitespace in our lexer. To this end ignore patterns can be specified.

match {
    r"\s*" => { }, // The default whitespace skipping is disabled if an `ignore pattern` is specified
    r"//[^\n\r]*[\n\r]*" => { }, // Skip `// comments`
    r"/\*[^*]*\*+(?:[^/*][^*]*\*+)*/" => { },  // Skip `/* comments */`
}

Unicode compatibility

LALRPOP is capable of lexing tokens that match the full unicode character set, or those that just match ASCII. If you need unicode matching, you should enable features = [ "unicode" ] in your Cargo.toml. Because lexing unicode requires loading the full unicode character set, enabling this feature will increase binary size, so you may wish to avoid it if you do not need unicode support.

It's important to note that certain character classes from perl regex extensions are "unicode friendly", and require unicode support. For example, "\s" matches unicode whitespace characters, not just ASCII ones, and likewise "\d" matches unicode digits (such as numerals in non-latin character sets). If you use those patterns in your lexer, you will require unicode.

You may wish to match only the ASCII subset of these characters, in which case, you can use the ASCII only character classes described here as substitutes and avoid adding unicode dependencies.

Lexing raw delimited content

Our calculator example operated on numbers and arithmetic operators. There is no overlap between the characters for numeric digits (0, 1, ...), the characters representing operators (+, -, ...) and parentheses ((, )), so it was easy to embed those tokens directly in the grammar, as we saw in the earlier sections.

However, clean lexical separations can be hard to identify in some languages.

Consider parsing a language with string literals. We will define a simple one; all it can do is bind variables, which are always single characters, like this:

x = "a"
y = "bc"

Using what we have learned so far, we might try a grammar like the following one:

use super::{Var, Lit, Eql};

grammar;

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

Unfortunately, this does not work; attempting to process the above grammar yields:

error: ambiguity detected between the terminal `r#"[x-z]"#` and the terminal `r#"[a-z]*"#`

We saw the explanation for why this happens in the previous section: the two regular expressions overlap, and the generated lexer does not know how to resolve the ambiguity between them.

Cut to the chase?

If you want to know "the right way" to solve this problem, you can skip straight to the end.

But if you want to understand why it is the right answer, you may benefit from taking the detour that starts now.

Exploring our options

A match declaration here, as suggested in the previous chapter, might seem like it fixes the problem:

use super::{Var, Lit, Eql};

grammar;

match {
   r"[x-z]"
} else {
   r"[a-z]*",
   _
}

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

With that match declaration in place we can successfully run a test like this one:

#![allow(unused)]
fn main() {
#[test]
fn fair_ball() {
    assert_eq!(nobol2::EqlParser::new().parse(r#"z = "xyz""#), Ok(('z', "xyz").into()));
}
}

Unfortunately, the match is actually only papering over the fundamental problem here. Consider this variant of the previous test:

#![allow(unused)]
fn main() {
#[test]
fn foul_ball() {
    assert_eq!(nobol2::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
}
}

The above produces:

---- foul_ball stdout ----
thread 'foul_ball' panicked at 'assertion failed: `(left == right)`
  left: `Err(UnrecognizedToken { token: (5, Token(3, "x"), 6), expected: ["r#\"[a-z]*\"#"] })`,
 right: `Ok(Eql(Var('z'), Lit("x")))`', doc/nobol/src/main.rs:43:5

What is the problem?

Merely specifying a precedence to favor tokenizing r"[x-z]" over r"[a-z]*" does not address the real problem here. That precedence rule causes an input like z = "x" to be split into tokens such that the x only matches the regular expression for the Var. It will not match the r"[a-z]*" in the Lit rule, even if it intuitively seems like it should; they have already been lexically categorized as different tokens at this point.

One could add further workarounds to deal with this. For example, one could change the Lit production to explicitly handle the r"[x-z]" regular expression as its own case:

pub Lit: Lit = {
    "\"" <r"[x-z]"> "\"" => <>.into(),
    "\"" <r"[a-z]*"> "\"" => <>.into(),
};

But this is a fragile workaround.

Specifically, this workaround is only applicable because we put artificial limits on this language.

If we wanted to generalize string literals to be able to contain other characters (such as whitespace), the technique described so far does not work out well. Consider this grammar:

match {
   r"[x-z]"
} else {
   r"[a-z ]*",
   _
}

pub Var: Var = <r"[x-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = {
    "\"" <r"[x-z]"> "\"" => <>.into(),
    "\"" <r"[a-z ]*"> "\"" => <>.into(),
};

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

Now, if we run the same test as before:

#![allow(unused)]
fn main() {
#[test]
fn spaceballs() {
    assert_eq!(nobol4::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
}
}

we get the following error output:

thread 'spaceballs' panicked at 'assertion failed: `(left == right)`
  left: `Err(UnrecognizedToken { token: (0, Token(2, "z "), 2), expected: ["r#\"[x-z]*\"#"] })`,
 right: `Ok(Eql(Var('z'), Lit("x")))`', doc/nobol/src/main.rs:58:5

Our attempt to generalize what strings can contain has caused problems for how the rest of the input is tokenized.

The right way to do this

Let us revisit the original rule in the grammar for string literals, from our first version:

pub Lit: Lit = "\"" <r"[a-z]*"> "\"" => <>.into();

The heart of our problem is that we have implicitly specified distinct tokens for the string delimiter ("\"") versus the string content (in this case, r"[a-z]*").

Intuitively, we only want to tokenize string content when we are in the process of reading a string. In other words, we only want to apply the r"[a-z]*" rule immediately after reading a "\"". But the generated lexer does not infer this from our rules; it just blindly looks for something matching the string content regular expression anywhere in the input.

You could solve this with a custom lexer (treated in the next section).

But a simpler solution is to read the string delimiters and the string content as a single token, like so:

pub Var: Var = <r"[a-z]"> => <>.chars().next().unwrap().into();

pub Lit: Lit = <l:r#""[a-z ]*""#> => l[1..l.len()-1].into();

pub Eql: Eql = <Var> "=" <Lit> => (<>).into();

(Note that this form of the grammar does not require any match statement; there is no longer any ambiguity between the different regular expressions that drive the tokenizer.)

With this definition of the grammar, all of these tests pass:

#[test]
fn homerun() {
    assert_eq!(nobol5::VarParser::new().parse("x"), Ok('x'.into()));
    assert_eq!(nobol5::LitParser::new().parse(r#""abc""#), Ok("abc".into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"x = "a""#), Ok(('x', "a").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"y = "bc""#), Ok(('y', "bc").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "xyz""#), Ok(('z', "xyz").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "x""#), Ok(('z', "x").into()));
    assert_eq!(nobol5::EqlParser::new().parse(r#"z = "x y z""#), Ok(('z', "x y z").into()));
}

Furthermore, we can now remove other artificial limits in our language. For example, we can make our identifiers more than one character:

pub Var: Var = <r"[a-z]+"> => <>.into()

which, with suitable changes to the library code, works out fine.

Escape sequences

Our current string literals are allowed to hold a small subset of the full space of characters.

If we wanted to generalize it to be able to hold arbitrary characters, we would need some way to denote the delimiter character " in ths string content.

The usual way to do this is via an escape sequence: \", which is understood by the lexical analyzer as not ending the string content.

We can generalize the regular expression in our new Lit rule to handle this:

pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => l[1..l.len()-1].into();

However, depending on your data model, this is not quite right. In particular: the produced string still has the escaping backslashes embedded in it.

As a concrete example, with the above definition for Lit, this test:

#![allow(unused)]
fn main() {
#[test]
fn popfly() {
    assert_eq!(nobol6::EqlParser::new().parse(r#"z = "\"\\""#), Ok(('z', "\"\\").into()));
}
}

yields this output:

thread 'popfly' panicked at 'assertion failed: `(left == right)`
  left: `Ok(Eql(Var('z'), Lit("\\\"\\\\")))`,
 right: `Ok(Eql(Var('z'), Lit("\"\\")))`', doc/nobol/src/main.rs:91:5

This can be readily addressed by adding some code to post-process the token to remove the backslashes:

pub Lit: Lit = <l:r#""(\\\\|\\"|[^"\\])*""#> => Lit(apply_string_escapes(&l[1..l.len()-1]).into());

where apply_string_escapes is a helper routine that searches for backslashes in the content and performs the corresponding replacement with the character denoted by the escape sequence.

Writing a custom lexer

Let's say we want to parse the Whitespace language, so we've put together a grammar like the following:

pub Program = <Statement*>;

Statement: ast::Stmt = {
    " " <StackOp>,
    "\t" " " <MathOp>,
    "\t" "\t" <HeapOp>,
    "\n" <FlowCtrl>,
    "\t" "\n" <Io>,
};

StackOp: ast::Stmt = {
    " " <Number> => ast::Stmt::Push(<>),
    "\n" " " => ast::Stmt::Dup,
    "\t" " " <Number> => ast::Stmt::Copy(<>),
    "\n" "\t" => ast::Stmt::Swap,
    "\n" "\n" => ast::Stmt::Discard,
    "\t" "\n" <Number> => ast::Stmt::Slide(<>),
};

MathOp: ast::Stmt = {
    " " " " => ast::Stmt::Add,
    " " "\t" => ast::Stmt::Sub,
    " " "\n" => ast::Stmt::Mul,
    "\t" " " => ast::Stmt::Div,
    "\t" "\t" => ast::Stmt::Mod,
};

// Remainder omitted

Naturally, it doesn't work. By default, LALRPOP generates a tokenizer that skips all whitespace -- including newlines. What we want is to capture whitespace characters and ignore the rest as comments, and LALRPOP does the opposite of that.

At the moment, LALRPOP doesn't allow you to configure the default tokenizer. In the future it will become quite flexible, but for now we have to write our own.

Let's start by defining the stream format. The parser will accept an iterator where each item in the stream has the following structure:

pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;

Loc is typically just a usize, representing a byte offset into the input string. Each token is accompanied by two of them, marking the start and end positions where it was found. Error can be pretty much anything you choose. And of course Tok is the meat of the stream, defining what possible values the tokens themselves can have. Following the conventions of Rust iterators, we'll signal a valid token with Some(Ok(...)), an error with Some(Err(...)), and EOF with None.

(Note that the term "tokenizer" normally refers to a piece of code that simply splits up the stream, whereas a "lexer" also tags each token with its lexical category. What we're writing is the latter.)

Whitespace is a simple language from a lexical standpoint, with only three valid tokens:

pub enum Tok {
    Space,
    Tab,
    Linefeed,
}

Everything else is a comment. There are no invalid lexes, so we'll define our own error type, a void enum:

pub enum LexicalError {
    // Not possible
}

Now for the lexer itself. We'll take a string slice as its input. For each token we process, we'll want to know the character value, and the byte offset in the string where it begins. We can do that by wrapping the CharIndices iterator, which yields tuples of (usize, char) representing exactly that information.

use std::str::CharIndices;

pub struct Lexer<'input> {
    chars: CharIndices<'input>,
}

impl<'input> Lexer<'input> {
    pub fn new(input: &'input str) -> Self {
        Lexer { chars: input.char_indices() }
    }
}

(The lifetime parameter 'input indicates that the Lexer cannot outlive the string it's trying to parse.)

Let's review our rules:

  • For a space character, we output Tok::Space.
  • For a tab character, we output Tok::Tab.
  • For a linefeed (newline) character, we output Tok::Linefeed.
  • We skip all other characters.
  • If we've reached the end of the string, we'll return None to signal EOF.

Writing a lexer for a language with multi-character tokens can get very complicated, but this is so straightforward, we can translate it directly into code without thinking very hard. Here's our Iterator implementation:

impl<'input> Iterator for Lexer<'input> {
    type Item = Spanned<Tok, usize, LexicalError>;

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.chars.next() {
                Some((i, ' ')) => return Some(Ok((i, Tok::Space, i+1))),
                Some((i, '\t')) => return Some(Ok((i, Tok::Tab, i+1))),
                Some((i, '\n')) => return Some(Ok((i, Tok::Linefeed, i+1))),

                None => return None, // End of file
                _ => continue, // Comment; skip this character
            }
        }
    }
}

That's it. That's all we need.

Updating the parser

To use this with LALRPOP, we need to expose its API to the parser. It's pretty easy to do, but also somewhat magical, so pay close attention. Pick a convenient place in the grammar file (I chose the bottom) and insert an extern block:

extern {
    // ...
}

Now we tell LALRPOP about the Location and Error types, as if we're writing a trait:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;
    
    // ...
}

We expose the Tok type by kinda sorta redeclaring it:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;

    enum lexer::Tok {
        // ...
    }
}

Now we have to declare each of our terminals. For each variant of Tok, we pick what name the parser will see, and write a pattern of the form name => lexer::Tok::Variant, similar to how action code works in grammar rules. The name can be an identifier, or a string literal. We'll use the latter.

Here's the whole thing:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;

    enum lexer::Tok {
        " " => lexer::Tok::Space,
        "\t" => lexer::Tok::Tab,
        "\n" => lexer::Tok::Linefeed,
    }
}

From now on, the parser will take a Lexer as its input instead of a string slice, like so:

#![allow(unused)]
fn main() {
    let lexer = lexer::Lexer::new("\n\n\n");
    match parser::parse_Program(lexer) {
        ...
    }
}

And any time we write a string literal in the grammar, it'll substitute a variant of our Tok enum. This means we don't have to change any of the rules we already wrote! This will work as-is:

FlowCtrl: ast::Stmt = {
    " " " " <Label> => ast::Stmt::Mark(<>),
    " " "\t" <Label> => ast::Stmt::Call(<>),
    " " "\n" <Label> => ast::Stmt::Jump(<>),
    "\t" " " <Label> => ast::Stmt::Jz(<>),
    "\t" "\t" <Label> => ast::Stmt::Js(<>),
    "\t" "\n" => ast::Stmt::Return,
    "\n" "\n" => ast::Stmt::Exit,
};

The complete grammar is available in whitespace/src/parser.lalrpop.

Where to go from here

Things to try that apply to lexers in general:

  • Longer tokens
  • Tokens that require tracking internal lexer state

Things to try that are LALRPOP-specific:

  • Persuade a lexer generator to output the Spanned format
  • Make this tutorial better

Using tokens with references

When using a custom lexer, you might want tokens to hold references to the original input. This allows to use references to the input when the grammar can have arbitrary symbols such as variable names. Using references instead of copying the symbols can improve performance and memory usage of the parser.

The Lexer

We can now create a new calculator parser that can deal with symbols the same way an interpreter would deal with variables. First we need the corresponding AST :

#![allow(unused)]
fn main() {
pub enum ExprSymbol<'input>{
    NumSymbol(&'input str),
    Op(Box<ExprSymbol<'input>>, Opcode, Box<ExprSymbol<'input>>),
    Error,
}
}

Then, we need to build the tokens:

#![allow(unused)]
fn main() {
#[derive(Copy, Clone, Debug)]
pub enum Tok<'input> {
    NumSymbol(&'input str),
    FactorOp(Opcode),
    ExprOp(Opcode),
    ParenOpen,
    ParenClose,
}
}

Notice the NumSymbol type holding a reference to the original input. It represents both numbers and variable names as a slice of the original input.

Then, we can build the lexer itself.

#![allow(unused)]
fn main() {
use std::str::CharIndices;

pub struct Lexer<'input> {
    chars: std::iter::Peekable<CharIndices<'input>>,
    input: &'input str,
}

impl<'input> Lexer<'input> {
    pub fn new(input: &'input str) -> Self {
        Lexer {
            chars: input.char_indices().peekable(),
            input,
        }
    }
}
}

It needs to hold a reference to the input to put slices in the tokens.

#![allow(unused)]
fn main() {
impl<'input> Iterator for Lexer<'input> {
    type Item = Spanned<Tok<'input>, usize, ()>;

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.chars.next() {
                Some((_, ' '))  | Some((_, '\n')) | Some((_, '\t')) => continue,
                Some((i, ')')) => return Some(Ok((i, Tok::ParenClose, i + 1))),
                Some((i, '(')) => return Some(Ok((i, Tok::ParenOpen, i + 1))),
                Some((i, '+')) => return Some(Ok((i, Tok::ExprOp(Opcode::Add), i + 1))),
                Some((i, '-')) => return Some(Ok((i, Tok::ExprOp(Opcode::Sub), i + 1))),
                Some((i, '*')) => return Some(Ok((i, Tok::FactorOp(Opcode::Mul), i + 1))),
                Some((i, '/')) => return Some(Ok((i, Tok::FactorOp(Opcode::Div), i + 1))),

                None => return None, // End of file
                Some((i,_)) => {
                    loop {
                        match self.chars.peek() {
                            Some((j, ')'))|Some((j, '('))|Some((j, '+'))|Some((j, '-'))|Some((j, '*'))|Some((j, '/'))|Some((j,' '))
                            => return Some(Ok((i, Tok::NumSymbol(&self.input[i..*j]), *j))),
                            None => return Some(Ok((i, Tok::NumSymbol(&self.input[i..]),self.input.len()))),
                            _ => {self.chars.next();},
                        }
                    }
                }
            }
        }
    }
}
}

It's quite simple, it returns any operator, and if it detects any other character, stores the beginning then continues to the next operator and sends the symbol it just parsed.

The parser

We can then take a look at the corresponding parser with a new grammar:

#![allow(unused)]
fn main() {
Term: Box<ExprSymbol<'input>> = {
    "num" => Box::new(ExprSymbol::NumSymbol(<>)),
    "(" <Expr> ")"
};
}

We need to pass the input to the parser so that the input's lifetime is known to the borrow checker when compiling the generated parser.

#![allow(unused)]
fn main() {
grammar<'input>(input: &'input str);
}

Then we just need to define the tokens the same as before :

#![allow(unused)]
fn main() {
extern {
    type Location = usize;
    type Error = ();
    
    enum Tok<'input> {
        "num" => Tok::NumSymbol(<&'input str>),
        "FactorOp" => Tok::FactorOp(<Opcode>),
        "ExprOp" => Tok::ExprOp(<Opcode>),
        "(" => Tok::ParenOpen,
        ")" => Tok::ParenClose,
    }
}
}

Calling the parser

We can finally run the parser we built:

#![allow(unused)]
fn main() {
let input = "22 * pi + 66";
let lexer = Lexer::new(input);
let expr = calculator9::ExprParser::new()
    .parse(input,lexer)
    .unwrap();
assert_eq!(&format!("{:?}", expr), "((\"22\" * \"pi\") + \"66\")");
}

Using an external library

Writing a lexer yourself can be tricky. Fortunately, you can find many libraries on crates.io to generate a lexer for you.

In this tutorial, we will use Logos to build a simple lexer for a toy programming language. Here is an example of what we will be able to parse:

var a = 42;
var b = 23;

# a comment
print (a - b);

Setup

In your Cargo.toml, add the following dependency:

logos = "0.14"

This will provide the logos crate and the Logos trait.

The AST

We will use the following abstract syntax tree (AST) as a representation of our expressions:

#![allow(unused)]
fn main() {
#[derive(Clone, Debug, PartialEq)]
pub enum Statement {
  Variable { name: String, value: Box<Expression> },
  Print { value: Box<Expression> },
}

#[derive(Clone, Debug, PartialEq)]
pub enum Expression {
  Integer(i64),
  Variable(String),
  BinaryOperation {
    lhs: Box<Expression>,
    operator: Operator,
    rhs: Box<Expression>,
  },
}

#[derive(Clone, Debug, PartialEq)]
pub enum Operator {
  Add,
  Sub,
  Mul,
  Div,
}
}

Implement the tokenizer

In a file named tokens.rs (or any other name you want), create an enumeration for your tokens, as well as a type for lexing errors:

#![allow(unused)]
fn main() {
use std::fmt;  // to implement the Display trait later
use std::num::ParseIntError;
use logos::Logos;

#[derive(Default, Debug, Clone, PartialEq)]
pub enum LexicalError {
    InvalidInteger(ParseIntError),
    #[default]
    InvalidToken,
}

impl From<ParseIntError> for LexicalError {
    fn from(err: ParseIntError) -> Self {
        LexicalError::InvalidInteger(err)
    }
}

#[derive(Logos, Clone, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+", skip r"#.*\n?", error = LexicalError)]
pub enum Token {
  #[token("var")]
  KeywordVar,
  #[token("print")]
  KeywordPrint,

  #[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().to_string())]
  Identifier(String),
  #[regex("[1-9][0-9]*", |lex| lex.slice().parse())]
  Integer(i64),

  #[token("(")]
  LParen,
  #[token(")")]
  RParen,
  #[token("=")]
  Assign,
  #[token(";")]
  Semicolon,

  #[token("+")]
  OperatorAdd,
  #[token("-")]
  OperatorSub,
  #[token("*")]
  OperatorMul,
  #[token("/")]
  OperatorDiv,
}
}

An exact match is specified using the #[token(...)] attribute.

For example, #[token("+")] makes the OperatorAdd token to be emitted only when a literal "+" appears in the input (unless it is part of another match, see below).

On the other hand, #[regex(...)] will match a regular expression.

The #[logos(...)] attribute around the enum defines regexes to skip when parsing the input. We've chosen to skip common whitespace characters, and single-line comments of the form # .... It also allows us to specify our custom error type, LexicalError if an unexpected token was encountered or if parsing an integer fails.

A few things to note about how Logos works:

When several sequences of tokens can match the same input, Logos uses precise rules to make a choice. Rule of thumb is:

  • Longer beats shorter.
  • Specific beats generic.

This means the "printa" input string will generate the following token:

  • Token::Identifier(String::new("printa"))

And not:

  • Token::KeywordPrint
  • Token::Identifier(String::new("a"))

This is because printa is longer than print, therefore the Identifier rule has priority.

Finally, we implement the Display trait:

#![allow(unused)]
fn main() {
impl fmt::Display for Token {
  fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
    write!(f, "{:?}", self)
  }
}
}

This is required because the token is included in the error message that LALRPOP generates if it fails at parsing.

Implement the lexer

This part is very similar to the previous tutorials. In a file lexer.rs (or any other name), we will implement the Lexer as required by LALRPOP.

First, we define our types and structures:

#![allow(unused)]
fn main() {
use logos::{Logos, SpannedIter};

use crate::tokens::{Token, LexicalError}; // your Token enum, as above

pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;

pub struct Lexer<'input> {
  // instead of an iterator over characters, we have a token iterator
  token_stream: SpannedIter<'input, Token>,
}
}

Then, we create the constructor for our Lexer:

#![allow(unused)]
fn main() {
impl<'input> Lexer<'input> {
  pub fn new(input: &'input str) -> Self {
    // the Token::lexer() method is provided by the Logos trait
    Self { token_stream: Token::lexer(input).spanned() }
  }
}
}

Finally, we implement the Iterator trait:

#![allow(unused)]
fn main() {
impl<'input> Iterator for Lexer<'input> {
  type Item = Spanned<Token, usize, LexicalError>;

  fn next(&mut self) -> Option<Self::Item> {
    self.token_stream
      .next()
      .map(|(token, span)| Ok((span.start, token?, span.end)))
  }
}
}

Update the grammar

Next, in our grammar.lalrpop file (or any other name), we can integrate our lexer as follows:

use crate::tokens::{Token, LexicalError};
use crate::ast;

grammar;

// ...

extern {
  type Location = usize;
  type Error = LexicalError;

  enum Token {
    "var" => Token::KeywordVar,
    "print" => Token::KeywordPrint,
    "identifier" => Token::Identifier(<String>),
    "int" => Token::Integer(<i64>),
    "(" => Token::LParen,
    ")" => Token::RParen,
    "=" => Token::Assign,
    ";" => Token::Semicolon,
    "+" => Token::OperatorAdd,
    "-" => Token::OperatorSub,
    "*" => Token::OperatorMul,
    "/" => Token::OperatorDiv,
  }
}

NB: This part allows us to give a precise name to the tokens emitted by our Lexer. We can then use those names ("identifier", "var", ...) in our grammar rules to reference the desired token.

Finally, we can build our rules:

#![allow(unused)]
fn main() {
pub Script: Vec<ast::Statement> = {
  <stmts:Statement*> => stmts
}

pub Statement: ast::Statement = {
  "var" <name:"identifier"> "=" <value:Expression> ";" => {
    ast::Statement::Variable { name, value }
  },
  "print" <value:Expression> ";" => {
    ast::Statement::Print { value }
  },
}

pub Expression: Box<ast::Expression> = {
  #[precedence(level="1")]
  Term,

  #[precedence(level="2")] #[assoc(side="left")]
  <lhs:Expression> "*" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Mul,
      rhs
    })
  },
  <lhs:Expression> "/" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Div,
      rhs
    })
  },

  #[precedence(level="3")] #[assoc(side="left")]
  <lhs:Expression> "+" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Add,
      rhs
    })
  },
  <lhs:Expression> "-" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Sub,
      rhs
    })
  },
}

pub Term: Box<ast::Expression> = {
  <val:"int"> => {
    Box::new(ast::Expression::Integer(val))
  },
  <name:"identifier"> => {
    Box::new(ast::Expression::Variable(name))
  },
  "(" <Expression> ")",
}
}

Our grammar is now complete.

Running your parser

The last step is to run our parser:

#![allow(unused)]
fn main() {
let source_code = std::fs::read_to_string("myscript.toy")?;
let lexer = Lexer::new(&source_code);
let parser = ScriptParser::new();
let ast = parser.parse(lexer)?;

println!("{:?}", ast);
}

Customizing the Build Process

When you setup LALRPOP, you create a build.rs file that looks something like this:

fn main() {
    lalrpop::process_root().unwrap();
}

This process_root() call simply applies the default configuration: so it will transform .lalrpop files into .rs files in-place (in your src directory), and it will only do so if the .lalrpop file has actually changed. But you can also use the Configuration struct to get more detailed control.

For example, to force the use of colors in the output (ignoring the TTY settings), you might make your build.rs file look like so:

fn main() {
    lalrpop::Configuration::new()
        .always_use_colors()
        .process_current_dir();
}

Rerun Directives

Cargo will rerun the build script on each compilation even if the lalrpop file has not changed. To disable this behavior, use the emit_rerun_directives function when setting up your lalrpop Configuration.

fn main() {
    lalrpop::Configuration::new()
        .emit_rerun_directives(true)
        .process_current_dir();
}

By default, this is set to false in case other parts of the build script or compilation code expects build.rs to be run unconditionally.

Using the Legacy LALR Parser

By default, LALRPOP uses the lane table algorithm which is LR(1) but creates much smaller tables. There is no longer any clear benefit to using the previous LALR implementation but it is still available.

To enable it, build with the LALRPOP_LANE_TABLE=disabled environment variable by setting std::env::set_var in your build.rs and add the #[LALR] attribute above the grammar; declaration in your lalrpop grammar file.

Up to version 0.15, LALRPOP was generating its files in the same directory of the input files. Since 0.16, files are generated in the Cargo's output directory.

If you want to keep the previous behaviour, you can use generate_in_source_tree in your configuration:

fn main() {
    lalrpop::Configuration::new()
        .generate_in_source_tree()
        .process().unwrap();
}

For each foo.lalrpop file you can simply have mod foo; in your source tree. The lalrpop_mod macro is not useful in this mode.

LALRPOP support conditional compilation of public non-terminal declarations via #[cfg(feature = "FEATURE")] attributes. If run in a build script LALRPOP will automatically pickup the features from cargo and use those. Alternatively an explicit set of features can be set using the Configuration type.

#![allow(unused)]
fn main() {
#[cfg(feature = "FEATURE")]
pub MyRule : () = {
    ...
};
}

Contributors

Here is a list of the contributors who have helped improving LALRPOP. This list may be incomplete. The "canonical list" can be found in the git history; if you have contributed but you are not in this list, please add yourself!