Using an external library
Writing a lexer yourself can be tricky. Fortunately, you can find on crates.io many different libraries to generate a lexer for you.
In this tutorial, we will use Logos to build a simple lexer for a toy programming language. Here is an example of what we will be able to parse:
var a = 42;
var b = 23;
# a comment
print (a - b);
Setup
In your Cargo.toml
, add the following dependency:
logos = "0.12.0"
This will provide the logos
crate and the Logos
trait.
The AST
We will use the following abstract syntax tree (AST) as a representation of our expressions:
#![allow(unused)] fn main() { #[derive(Clone, Debug, PartialEq)] pub enum Statement { Variable { name: String, value: Box<Expression> }, Print { value: Box<Expression> }, } #[derive(Clone, Debug, PartialEq)] pub enum Expression { Integer(i64), Variable(String), BinaryOperation { lhs: Box<Expression>, operator: Operator, rhs: Box<Expression> }, } #[derive(Clone, Debug, PartialEq)] pub enum Operator { Add, Sub, Mul, Div, } }
Implement the tokenizer
In a file named tokens.rs
(or any other name you want), create an enumeration
for your tokens:
#![allow(unused)] fn main() { use std::fmt; // to implement the Display trait use logos::Logos; #[derive(Logos, clone, Debug, PartialEq)] pub enum Token { #[token("var")] KeywordVar, #[token("print")] KeywordPrint, #[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().parse())] Identifier(String), #[regex("\d+", |lex| lex.slice().parse())] Integer(i64), #[token("(")] LParen, #[token(")")] RParen, #[token("=")] Assign, #[token(";")] Semicolon, #[token("+")] OperatorAdd, #[token("-")] OperatorSub, #[token("*")] OperatorMul, #[token("/")] OperatorDiv, #[regex(r"#.*\n?", logos::skip)] #[regex(r"[ \t\n\f]+", logos::skip)] #[error] Error, } }
An exact match is specified using the #[token()]
annotation.
For example, #[token("+")]
makes the OperatorAdd
token to be emitted only
when a literal "+"
appears in the input (unless it is part of another match,
see below).
On the other hand, #[regex()]
will match a regular expression.
For example, combined with the logos::skip
annotation,
#[regex(r"#.*\n?", logos::skip)]
causes all matches of a #
character until
a newline character (comments of our language) to be ignored.
A few things to note about how Logos works:
When several sequences of tokens can match the same input, Logos uses precise rules to make a choice. Rule of thumb is:
- Longer beats shorter.
- Specific beats generic.
This means the "printa"
input string will generate the following token:
Token::Identifier(String::new("printa"))
And not:
Token::KeywordPrint
Token::Identifier(String::new("a"))
This is because printa
is longer than print
, therefore the Identifier
rule
has priority.
Finally, we implement the Display
trait:
#![allow(unused)] fn main() { impl fmt::Display for Token { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "{:?}", self) } } }
This is required because the token is included in the error message that LALRPOP generates if it fails at parsing.
Implement the lexer
This part is very similar to the previous tutorials. In a file lexer.rs
(or
any other name), we will implement the Lexer as required by LALRPOP.
First, we define our types and structures:
#![allow(unused)] fn main() { use logos::{Logos, SpannedIter}; use crate::path::to::tokens::Token; // your enum pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>; pub enum LexicalError { InvalidToken, } pub struct Lexer<'input> { // instead of an iterator over characters, we have a token iterator token_stream: SpannedIter<'input, Token>, } }
Then, we create the constructor for our Lexer:
#![allow(unused)] fn main() { impl<'input> Lexer<'input> { pub fn new(input: &'input str) -> Self { // the Token::lexer() method is provided by the Logos trait Self { token_stream: Token::lexer(input).spanned() } } } }
Finally, we implement the Iterator
trait:
#![allow(unused)] fn main() { impl<'input> Iterator for Lexer<'input> { type Item = Spanned<Token, usize, LexicalError>; fn next(&mut self) -> Option<Self::Item> { self.token_stream.next().map(|(token, span)| { match token { // an invalid token was met Token::Error => Err(LexicalError::InvalidToken), _ => Ok((span.start, token, span.end)), } }) } } }
Update the grammar
Next, in our grammar.lalrpop
file (or any other name), we can integrate our
lexer as follows:
use crate::path:to:{
tokens::Token,
lexer::LexicalError,
ast,
};
grammar;
// ...
extern {
type Location = usize;
type Error = LexicalError;
enum Token {
"var" => Token::KeywordVar,
"print" => Token::KeywordPrint,
"identifier" => Token::Identifier(<String>),
"int" => Token::Integer(<i64>),
"(" => Token::LParen,
")" => Token::RParen,
"=" => Token::Assign,
";" => Token::Semicolon,
"+" => Token::OperatorAdd,
"-" => Token::OperatorSub,
"*" => Token::OperatorMul,
"/" => Token::OperatorDiv,
}
}
NB: This part allows us to give a precise name to the tokens emitted by our
Lexer. We can then use those names ("identifier"
, "var"
, ...) in our grammar
rules to reference the desired token.
Finally, we can build our rules:
#![allow(unused)] fn main() { pub Script: Vec<Box<ast::Statement>> = { <stmts:Statement*> => stmts } pub Statement: Box<ast::Statement> = { "var" <name:"identifier"> "=" <value:Expression> ";" => { Box::new(ast::Statement::Variable { name, value }) }, "print" <value:Expression> ";" => { Box::new(ast::Statement::Print { value }) }, } pub Expression: Box<ast::Expression> = { #[precedence(lvl="1") Term, #[precedence(lvl="2")] #[assoc(side="left")] <lhs:Expression> "*" <rhs:Expression> => { Box::new(ast::Expression::BinaryOperation { lhs, operator: ast::Operator::Mul, rhs }) }, <lhs:Expression> "/" <rhs:Expression> => { Box::new(ast::Expression::BinaryOperation { lhs, operator: ast::Operator::Div, rhs }) }, #[precedence(lvl="3")] <lhs:Expression> "+" <rhs:Expression> => { Box::new(ast::Expression::BinaryOperation { lhs, operator: ast::Operator::Add, rhs }) }, <lhs:Expression> "-" <rhs:Expression> => { Box::new(ast::Expression::BinaryOperation { lhs, operator: ast::Operator::Sub, rhs }) }, } pub Term: Box<ast::Expression> => { <val:"int"> => { Box::new(ast::Expression::Integer(val)) }, <name:"identifier"> => { Box::new(ast::Expression::Variable(name)) }, "(" Expression ")", } }
Our grammar is now complete.
Running your parser
The last step is to run our parser:
#![allow(unused)] fn main() { let source_code = std::fs::read_to_string("myscript.toy")?; let lexer = Lexer::new(&source_code[..]); let parser = ScriptParser::new(); let ast = parser.parse(lexer)?; println!("{:?}", ast); }