Rue Lang is a new programming language I’ve been working on for quite a while. It’s based on Rust, and aims to provide an alternative to Chialisp for writing puzzles for smart coins on the Chia blockchain.
When I first started working on Rue a couple years ago, I knew it would be a time consuming project. But what I didn’t prepare for was the sheer amount of time I’d spend writing and rewriting the same code over and over again. I’ve come about as close as you can get to going insane without actually going insane (at least, not yet).
So was it worth it? Well, I’ll let you be the judge of that.
What is Chialisp?
It’s important to start with where we are today. All puzzles on the Chia blockchain are currently written using a language named Chialisp. It is, as the name suggests, based on Lisp.
Chialisp is pure - given the same input, a program will always provide the same output. Programs do not rely on any external state, and in fact are themselves stateless (there are no variables that can be mutated).
It’s also a functional language, which means that functions aren’t given special treatment and are just values that can be passed around like any other. The combination of these two traits makes it straightforward to audit the logic of the program if you can understand the syntax.
What is CLVM?
CLVM (the Chialisp Virtual Machine) is a low level bytecode language. You can think of it like the Chia equivalent of assembly language, or similar to the EVM on Ethereum. The bytecode gets executed on a fully sandboxed virtual machine when you spend a coin.
Every value in CLVM is either a pair of two values or an atom (a sequence of bytes). This is enough to represent anything, including lists and operators. However, these constructs are extremely low level and hard to read without having access to the pre-compilation source code. Additionally, the number of available operators is quite limited compared to what you may expect.
The idea is that Chialisp compiles down into CLVM, rather than us writing the CLVM bytecode itself. However, Chialisp is missing several abstractions we come to expect from programming languages:
It does not support lexical scoping, where you can declare bindings in multiple nested scopes and reference them from anywhere within. You can only reference constants or parameters of the current function (not parameters of the outer module).
It does not have a type checker. All values are either a pair or an atom (like in CLVM) and there’s no way to enforce at compile time that the correct value is passed into a function.
There’s no standard library, aside from a few helper macros like list and if.
Classic Chialisp lacks the ability to declare variable bindings, though the less commonly used modern compiler does support them.
The syntax is often hard to understand and is missing niceties like statements. This reduces readability, and therefore auditability.
And thus we end up writing a lot of CLVM opcodes directly in our Chialisp programs, which defeats much of the point of a higher level language.
Introducing Rue
Rue is a high level programming language that compiles to CLVM just like Chialisp does. The compiler might sound like a simple translation between Rue and CLVM. But that couldn’t be further from the truth. Instead, it requires several stages to convert the syntax into something that can be interpreted by the CLVM runtime.
Lexer
The lexer is responsible for breaking the source code into small pieces called “tokens”. The reason for doing this is primarily to make it easier to parse one token at a time efficiently rather than needing to backtrack when you reach a character you don’t expect. It also improves the parser’s error messages considerably.
This means turning code like this:
3 + 2
Into the following tokens: Int, Whitespace, Int
Parser
Once we’ve produced a list of tokens, the parser will convert these into “parse tokens”, which are slightly simplified versions of the lexer tokens. Things like missing end quotes are turned into parser errors at this stage.
Then, a hand written grammar is applied to the parse tokens recursively to convert them into a lossless parse tree. All whitespace is preserved in the document, to make it possible to reconstruct the source code later (which is useful for language servers and formatters, for example).
Instead of a linear list of tokens, we now have a tree of syntax nodes for language constructs like functions, statements, and expressions.
AST
The parse tree is useful, but it’s loosely typed and lacking concrete structure. It’s possible for subtrees to be missing due to parsing errors. We need an abstraction on top of it that provides a simpler interface to work with for the compiler.
This is where the Abstract Syntax Tree (AST) comes in. We can lazily walk the parse tree and find relevant nodes, using AST types with methods such as expr()
and args()
.
Thankfully it’s pretty easy to convert from the parse tree to the AST, since they’re both in a tree structure. This is the final step before the actual compilation happens.
HIR
The next step is to compile from the AST to the HIR (high-level intermediate representation). This is when the bulk of the work of the compiler happens:
We declare types
We declare symbols that can reference declared types
We compile types that can reference other types
We compile symbols and check the resolved types
We compile expressions into HIR nodes
It’s done in this order to ensure that the order of items (functions and constants) in the program doesn’t matter. You can reference an item from another regardless of the order (even recursively) as long as they’re both in scope. However, because of this, every item must have a well defined return type and can’t rely on inference (because the body might not have been resolved by the time the type is needed).
The type system is still a work in progress, so I won’t dive too much into it. But it contains types like Bytes, Bytes32, Int, and so on. Everything must be either explicitly typed or have the type inferred from its value. This reduces the number of runtime errors you run into while debugging programs written in Rue.
LIR
There used to be a MIR phase between the HIR and LIR, but that was removed for simplicity. The LIR (low-level intermediate representation) is a single step above CLVM and represents all of the operators that you can compile code into. There aren’t many abstractions at this level and it’s perfect for performing final optimizations before generating the final CLVM bytecode.
The most challenging aspect of converting between the HIR and LIR is the environment. Rue has a concept of lexical scoping, if statements, and immutable variable bindings. On the other hand, CLVM only understands the “environment”, which is a non-hierarchical tree of values that are available to the program contained within.
If you want to add a binding to the environment, you have to effectively curry it into the rest of the program. This means taking the program (which is a function) and creating a new one with some of its arguments pre-applied. However, when you do this, you have to also shift to the right the environment path for all other symbols referenced inside the program (since the new symbols are inserted on the left).
An if statement or assert statement gets translated into an if expression with the rest of the program as one of its branches. These constructs are just syntactic sugar for expressions.
When you lower a function into the LIR, you need to make sure that any symbols it references that are in an enclosing scope (captures) get explicitly passed in when the function is called. Additionally, if you try to use a function as a value, it must first be turned into a closure that has its captured arguments curried in at runtime.
This translation step is the hardest to get right, and now that I’ve figured it out I don’t plan on touching it more than I have to (since it works). The last time I built the Rue compiler, this roadblock was the reason I stopped working on it for about 8 months. The HIR => MIR conversion I had at the time was fundamentally broken and I didn’t know how to fix it.
So what now?
Now that I’ve solved the HIR => LIR conversion process, I can focus on the type system and adding language features like I was before. This is the exciting part since I have a clear picture of how the language should work (it will be slightly different than previous designs).
I expect to have a usable and well tested implementation of the Rue compiler in the coming month(s). Until then, stay tuned!