For the first post in this series, read here.
In this post, I’ll start to get into the specifics of how to actually write a translator. I wouldn’t recommend doing the entire thing by hand. Most of the complicated and time-instensive work can be mitigated by using free tools. These types of tools are called ‘parser generators’ or ‘compiler compilers‘.
My preferred tool is called ANTLR (Another Tool for Language Recognition). ANTLR is a powerful tool with lots of easy solutions for common problems, but it can also be easy to trip yourself up. Let’s go over some of the concepts you’ll need to understand before digging in.
ANTLR doesn’t use regular expressions, but uses many similar concepts. You’ll want to make sure you have an understanding of basic regex concepts before moving forward. If you understand how the +, *, ?, |, and () operators work, then you should be ok.
Parser generators make heavy use of a syntax called Backus-Naur Form, or BNF for short. An ANTLR statement expressed in BNF might look something like this:
WHITE_SPACE : ( ' ' | '\t' | '\r' | '\n' )+ ;
In this example, we’re describing how to tokenize the input. Our parser will define any groups of space, tab, or line break characters as a WHITE_SPACE token. In the previous post, we examined the three steps of translation:
This WHITE_SPACE example is used in the lexing step. In the parsing step we would have a statement with similar BNF syntax:
whileLoop : LOOP_WHILE PAREN_OPEN booleanExpression PAREN_CLOSE statement ;
Unlike the WHITE_SPACE statement, this whileLoop statement doesn’t define a token. Instead, it describes how the parser should create a node in a tree data structure (more detail below).
If we can describe the syntactic structures of an entire language using BNF statements, we call series of statements as a BNF grammar. When using ANTLR, we can describe the entire grammar of a language in a single file. The BNF statements for both our lexer and parser will be in this file, and we can distinguish between them using a simple naming convention: uppercase vs lowercase.
For example, only the lexer will use our WHITE_SPACE statement because it is written in all caps. The first character is used to differentiate, so something like White_space would still be interpreted as a lexer statement. All caps is typically used because it makes the distinction more obvious. Conversely, parser statements must start with lowercase, and typically use camel case.
Thinking with Trees
The goal of our ANTLR lexer / parser is to generate an abstract syntax tree, or AST for short. This is the name of the data structure that I alluded to above. In the previous post, we looked at an AST for the English language as an example:
However, most programming languages look nothing like English. It’s more likely we will be interested in dealing with math. Let’s say we have this mathematical expression:
If our parser understands PEMDAS, it should create an AST that looks like this:
Notice that if we were to evaluate this expression, the first calculations would be at the bottom of the tree, and the last calculations at the top. This is because the tree is evaluated using a depth-first traversal.
That was a lot of information to soak up! If you’ve understood all of the concepts presented in this post, you’re almost ready to start creating your own lexers and parsers. In the next post we will look at some simplified examples of ANTLR grammars.