As a programmer, I often find myself succumbing to the allure of algorithms. There’s nothing quite like expanding your mental horizon by learning a new concept or pattern.
Some of the most powerful and elegant algorithms are the ones used in language translation. This is the same process your compiler uses to translate a human-readable programming language into machine code. You can learn to harness these same algorithms to create your own domain-specific language.
The overall task of translating languages is so intimidating that most people just think of their compilers as a ‘black box’ that accomplishes its job using magic. In this series of posts, I’ll attempt to break this process down into understandable chunks and explain each one.
Steps of the Process
Translation algorithms almost always utilize a three-step process:
First we need to tackle an issue of terminology. When programmers use the word ‘compiler‘, they’re referring to a tool which translates a human-readable programming language into machine code. The terms translate and compile are two separate concepts, and confusingly, what is colloquially called a compiler will here be referred to as a translator. To reiterate, compiling is the third step in the translation process, a process which most programmers refer to as compilation.
Let’s take a look at the three steps in more detail…
A lexer is an augmented tokenizer. Similar to a tokenizer, a lexer will take a string as input and break it up into chunks. Let’s use the English language to make an analogy. If your input is an English sentence, both a lexer and a tokenizer would return you a list containing words and punctuation. Lexers take this idea one step further by attaching extra metadata to each token. A lexer would return not just the word, but also tell you, ‘this word is a noun’. For example:
Once we’ve got a string of tokens from our lexer it’s on to the next step.
Parsers are the next logical step after lexers. The parser is responsible for taking that linear stream of tokens from the lexer and arranging them into more abstract, higher-level data structures. The data structures are hierarchical, and therefore take the shape of a tree. These ‘trees’ represent the syntactic structures of the language. The root of the tree is the most abstract grouping, while the ‘leaves’ are tokens from the lexer.
Let’s return to our English analogy. If a lexer outputs a list of words, then the parser outputs groupings of abstractions that arise from those words. Examples of these abstractions could be clauses, sentences, paragraphs etc. If we were to continue our example from the previous step, we might get something like this:
After the parsing step, we would have a data structure that represents all abstractions that can be derived from the list of words.
The previous two steps of the translation process involve lots of boilerplate that is conceptually the same for any translation. However, the compile step requires an implementation that is specifically tailored to the languages being translated.
The compiler is the last part of the translation process. At this point, we’ve got our input converted into a convenient data structure. The compiler’s job is to traverse that data and translate it into the new language.
Just a Primer
So far this has just been a high level overview. In the next post, we will examine concepts you’ll need when creating a lexer and parser.
For the next post in this series, read here.