RC28. Notes Towards an Assembler

June 22, 2023 3 minute read

Daily dispatches from my 12 weeks at the Recurse Center in Summer 2023

Today I started thinking about a plan for implementing an assembler. Here are a few notes and musings I had for the lexer component, which will be responsible for reading in and tokenizing each line of assembly. I may have to dispense with some of the formalism here in the name of getting a working parser by next week, however I’m sensing that this is going to be a meaty project that I may want to develop further, in which case this preparatory work may come in handy.

Here’s my thinking thus far:

A program consists of one of more instructions. An instruction consists of an A instruction (address instruction),a C instruction (compute instruction), or a L instruction (loop instruction – demarcates a point for the program to loop to).

An A Instruction consists of an @ symbol followed by either a number literal or a symbol. A number literal consists of digits. A symbol conists of one or more letters followed by any number of letters, numbers, and certain valid other characters. Valid A-instructions: @100, @i, @SCREEN, etc.

A C Instruction consists of a destination and a computation OR a computation and a jump command OR all three. A destination consists of some combination of M, A, and D (specifying whether to load the value to the A register, D register, or a register in RAM, or some combination) plus =. A jump command consists of ; followed by one of the following jump codes: JGT, JEQ, JGE, JLT, JNE, JLE, or JMP. The most complex piece is the middle computation component, which can by one of 18 valid arithmetic options. Valid C-instructions: MD=D+A, M;JGT, D=-1, M=D&M;JLE, etc.

An L instruction lastly consists of an opening and closing parentheses enclosing a symbol. Valid L-instructions: (LOOP), (END), etc.

In an attempt to formalize this in terms of (loose) RegEx, I arrived at the following:

program:        <instruction>+
instruction:    <a-instruction> | <c-instruction> | <l-instruction>

a-instruction:  @ [<number> <symbol>]
c-instruction:  <destination><computation> | <computation><jump> | <destination><computation><jump>
l-instruction:  '(' <symbol> ')'

number:         [0-9]+
symbol:         [a-zA-Z]+[a-zA-Z0-9_\.\$:]*
destination:    [MAD]{1,3}=
computation:    [AMD]?[-!|+&]?[10AMD]
jump:           JGT | JEQ | JGE | GLT | JNE | JLE | JMP

Not all of these RegEx will result in valid commands – for instance, MMM would be a match for destination, or &1 would be a match for computation – but I believe it is the job of the parser to check for validity? In any case, my focus here is just to identify tokens for each incoming line, and I think this should do it.

Other things

Three pairing sessions today!
- Paired with a data scientist on a regression problem I’m working on, which left me feeling encouraged that I’m on the right track and not behaving too badly when it comes to data science best practices
- Paired with someone from the new batch, ostensibly on a Rust text editor project but actually on a wide range of material. Got exposed to so much – new CLI tools like Starship, project ideas like writing a text editor in C or building an interpreter, Rust, Elixir, text editors I’d never heard of, Polars, best practices for learning, etc etc. This convo could not have come at a better time since, among other things, it gave me some rad project ideas to carry me through the latter half of RC
- Paired on nand2tetris CPU
Also had a meeting with an RC career counselor and walked away feeling like I’m doing it (RC) right. Yes! The theme of the week has been repeatedly soliciting and receiving this confirmation in spite of my belief that I “have no projects”.

Zach Rottman, PhD

Other things