July 24, 2017
As a developer without a computer science background, I have some fundamental knowledge gaps to fill. With this in mind, the decision process behind writing my own compiler was fairly straightforward. It went something like this:
I write Java at work.
Java is a compiled language.
Do I know what that means?
Maybe (but probably not really).
The goal of this project was to gain an understanding of the compilation process, and a broader perspective on language design and implementation.
Here, we compile a Lisp-like language (Emerald) to Ruby, using a compiler written in Ruby. Two high level languages were used, because it enabled me to concentrate on the learning process of understanding how a compiler works. It also gave me the opportunity to play around with language design in creating my own Lisp-like Ruby, called Emerald.
Zoom out: What is a compiler?
In a nutshell, a compiler is a program that converts code written in one programming language to another. Usually, this involves converting a higher level language (e.g. Java) to a lower level form (e.g. Java bytecode). However compilers can also convert one high level language to another high level language (such as this one).
Compiling a Lisp
Lisps can look intimidating on first glance, but a Lisp was chosen because they typically possess a very simple, uniform syntax. Lisps usually have limited data structures, consisting two basic data types: atoms and lists.
Lists are enclosed within parentheses and are actually a form of binary tree data structure. An atom is every other data type e.g symbols, numbers. The tree structure of the example above will be seen later. In Emerald, the atom data structure does not include strings and numbers.
Compiling to Ruby
There was no particular reason for selecting Ruby as both the target language and the language of implementation. Any language could have been chosen for either of these. Ruby seemed like a sensible choice because it’s a language that I was comfortable with, which allowed me to maintain focus on understanding the compilation process. The next compiler I write will probably be written in, or compile to, something different!
The compilation process can be broken down into five steps:
1. Read source code from a file 2. Carry out lexical analysis and parsing 3. Convert the abstract syntax tree (AST) into Ruby code 4. Optimise 5. Write to a file and evaluate
1. Read source code from a file
For this program, an input file is specified in the command line. Its contents are read as a string. This is an example input Emerald function:
2. Lexical analysis and parsing
The tree structure of the input function specified above can be viewed as:
Or, in other words:
To convert the input Emerald content into an AST of Ruby Objects there are two key steps: dividing the text into nodes (lexical analysis) and categorising the nodes (parsing).
In order to achieve this, the input code is read as a string. The first node or token encountered is categorised as a string, number, or atom, depending on the first character of the node. If the character is a bracket e.g.
(, the bracket is exchanged for a symbol and handled at a later stage. This categorisation takes place as follows:
The parsed node is then added to an empty list, and the rest of the source string is returned, minus the first node. The next node in the string is parsed in an identical fashion. When whitespace or a newline is encountered, then the source text is simply returned with the whitespace/newline removed. This process continues, until each node in the input has been parsed. Finally, a flat list of Ruby objects is returned:
In order to then create the nested list structure, a recursive depth-first search algorithm is used on the array of nodes. This begins with an empty list, and iterates through each element. Previously, all of the left and right brackets were substituted with
:right_bracket symbols. When
:left_bracket is encountered (signaling the start of a new list), the method is recursively called, with the index incremented by one. The result of this is then pushed onto the list. When
:right_bracket is encountered (signaling the end of a list), we return the incremented index, and a List object of all the elements that have been pushed onto the list. When any other character is encountered (list contents), the character is pushed onto the list, and the index is incremented.
This subsequently creates the AST of Ruby objects:
3. Convert the abstract syntax tree (AST) into Ruby code
Generating Ruby code from the AST is carried out based on semantic analysis and Ruby syntax. Here, the AST is taken as the input source and is comprised of lists of strings, numbers, or atoms. Using this to generate Ruby code is primarily influenced by the list structures and the contents of each atom:
Identifying the meaning of each atom (lexical analysis), allows the construction of Ruby code from the source AST, based on known Ruby syntax.
Optimisation steps in the compilation process can be implemented for the purpose of enhancing performance, such as to use less memory or to run the program faster. Optimisation has yet to be carried out for this project.
5. Write to a file and evaluate
When running the application, the input lisp:
is compiled to:
which is written to a file and evaluated. The result is then output to the command line.
Parsing the input and generating Ruby code are the core components of the compiler process. Reading the input and writing the output, are used to facilitate compilation and to produce a meaningful output to the process. Evaluating the output was not a necessary inclusion for the compiler, but was nice to have!
Limitations and Areas for Improvement
This compiler has numerous limitations, and plenty of potential functionality that has yet to be developed! After a certain amount of time however, I felt that I had learned what I had initially set out to. As rewarding as it is to compile new and more complicated programs, the learnings started to feel like they were plateauing for this project. There remain several aspects of its functionality that I would like to develop and improve in future however. These primarily are:
In addition to gaining an understanding of how compilers work, some other key learnings from this project are:
Particularly when you’re exploring different approaches to carrying out a particular task, and need a safety net. TDD enabled me to build up the parser incrementally, and when I decided that I wanted to refactor my implementation, the tests enabled me to do so freely.
It can be as comprehensive as you want, and the amount of time you put into it really depends on its eventual purpose.
I’m very lucky. I’ve had the opportunity to work with two amazing developers in the past few months.
Louis is essentially the Pied Piper of the coding world. I learn something new every time I speak to him. This wonderful person reviewed PRs, gave feedback, and shaped my vision, particularly when getting started with this project.
Andrei is a compulsive-learner. His enthusiasm is quite infectious. He likes algorithms (a lot) and was rather excited when I asked him to review my approach to parsing s-expressions. He was very insightful when it came to producing a more preferable recursive one (which is used above).
Source code for this project and Emerald language specifications can be found here. PRs and feedback are always welcome ;)