7 minute read

Daily dispatches from my 12 weeks at the Recurse Center in Summer 2023

A few weeks back I built an assembler in Python as a project for Nand2Tetris. When I was first plotting my approach and feeling ambitious, I thought, hey, maybe I should write this in C. But every time I thought about how to actually do that, it felt, well, impossible, so in the end I just used Python to do my dirty work.

Today, however, was Impossible Stuff Day, which meant it was time for me to dive in and see how far I could get.

Assembler Preamble(r)

The assembler’s job is to take an assembly program like this one from n2t

@R0
D=M
@R1
D=D-M
@OUTPUT_FIRST
D;JGT
@R1
D=M
@OUTPUT_D
0;JMP
(OUTPUT_FIRST)
@R0
D=M
(OUTPUT_D)
@R2
M=D
(INFINITE_LOOP)
@INFINITE_LOOP
0;JMP

and translate it into machine language like this

0000000000000000
1111110000010000
0000000000000001
1111010011010000
0000000000001010
1110001100000001
0000000000000001
1111110000010000
0000000000001100
1110101010000111
0000000000000000
1111110000010000
0000000000000010
1110001100001000
0000000000001110
1110101010000111

In the case of nand2tetris, we’re working with a 16-bit machine, which means each instruction is 16-bits long.

I wrote about assembly basics a few weeks ago when building the CPU to this computer, which is the thing that actually does stuff with the 16-bit codes above. It’s worth revisiting some of the key ideas here.

In the nand2tetris dialect of assembly, there are two types of commands:

  • address commands (they start with @ and load a value into the CPU’s A Register)
  • compute commands (the other stuff, which tells the CPU which registers to read from, what to do with those values, and where to store the result)

(There are also labels – the things in parentheses – which aren’t really commands but label line-numbers so that the program can jump around later on.)

Address commands are the simplest, since converting them to 16-bit strings is a matter of converting the number to a 15-bit integer and sticking a 0 on the front end, which is what identifies the instruction as an A Instruction.

0 v v v v v v v v v v v v v v v
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
  • the most significant bit (set to 0) identifies the instruction as an A instruction
  • the least significant 15 bits represent a 15-bit unsigned integer

C Instructions are trickier, since they have three component parts that are encoded into various chunks of the 16-bit instruction string. For sample C-instructions like “AD=M+1” or “D;JEQ” or “D=1” or “D=M-1;JLT”:

  • everything to the left of the = is the destination part of the command – i.e., where the ALU’s output needs to go. (D is the destination in the command D=M+1.) This piece of information gets encoded as a 3-bit sequence and spans bits 3-5 of the resulting instruction string.
  • everything to the right of the = and to the left of the ; is the computation command – i.e., what computation the ALU performs on its inputs. (M+1 is the computation in the command D=M+1.) There are a lot of options here, so this piece of information gets encoded as a 7-bit sequence and spans bits 6-12 of the resulting instruction string.
  • everything right of the ; is the jump command, which, if present, loads the value in the A Register (in this case representing a program line number) into the Program Counter, thus “jumping” to another line in the program. (JGT specifies the jump in the command D;JGT.) This piece of information gets encoded as a 3-bit sequence and spans bits 0-2 in the resulting instruction string.

That’s complicated, but the whole thing looks like this:

1 x x a c c c c c c d d d j j j
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
  • the most significant bit (set to 1) identifies the computation instruction as such
  • bits 13 and 14 aren’t used
  • the a bit (bit 12) dictates whether the ALU should accept its y input from the A Register or from somewhere in Memory. This is effectively part of the computation instruction, which is why I grouped it that way in describing it above.
  • the 6 c bits (bits 6-11) specify the ALU function and are fed into its 6 function inputs zx, nx, zy, ny, f, and no.
  • the 3 d bits (bits 3-5) specify the destination for the ALU’s output, either M (memory), A (A register), or D (D register), or some combination of the three.
  • the 3 j bits (bits 0-2) specify any jump command based on whether the ALU’s output is < 0, <= 0, == 0, >= 0, or > 0.

Assembler in C

Right. So how far did I get porting my assembler to C today? Well, not really that far, I suppose. Basically, I handled the A-Instructions (and only in the case where A-Instructions consist of Number literals, at that) – i.e., taking a command like @53 and emitting 0000000000110101 or a command like @9 and emitting 0000000000001001.

Here’s what my code in action looks like at the moment:

 1:    @2 --> 0000000000000010
 2:   D=A --> C Command
 3:    @3 --> 0000000000000011
 4: D=D+A --> C Command
 5:    @0 --> 0000000000000000
 6:   M=D --> C Command

Of course the C Commands aren’t yet implemented, so just placeholders for now. But the A Instructions are doing their thing and doing it right.

It doesn’t seem like much, and maybe it isn’t. But getting here meant bumping into many, MANY walls, which meant making HUGE leaps.

Here are some highlights:

Reading in a file

Here’s how.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
    // exit if argc is not as expected
    if (argc != 2) {
        printf("usage: assembler <path>\n");
        return EXIT_FAILURE; // macro defined in stdlib.h
    }

    FILE *f;
    char line_in[256];
    int linecount = 0;

    // open file passed as CL arg
    f = fopen(argv[1], "r");

    // check for errors opening file
    if (f == NULL) {
        printf("Error opening file %s\n", argv[1]);
        return EXIT_FAILURE;
    }

    // loop through input file and parse
    while (fgets(line_in, sizeof line_in, f) != NULL) {

        // remove trailing newline or carriage return
        line_in[strcspn(line_in, "\n\r")] = '\0';

        // strip comments
        char *comment_pos = strstr(line_in, "//");
        if (comment_pos != NULL)
            *comment_pos = '\0';

        // do stuff if line not blank
        if (line_in[0] != '\0')
            printf("%2d: %5s --> ", ++linecount, line_in); // print line_in
        }
    }

    // close file
    fclose(f);

    return 0;
}

All this does is read in the file passed along as a command-line argument and print it back out to stdout (along with a line number), stripping comments and removing blank lines as it does so.

fgets() is the bread-and-butter of this version of a file reader, although I’ve been advised to consider getline(). To remove trailing newlines and carriage returns, I’m using strcspn() to reset instances of those with a string terminator. To strip comments, I’m using strstr() in a somewhat similar way to return a pointer to the location where a comment is found and reassigning that index a string terminator. As long as there’s something in the resulting line, (i.e., the first character isn’t \0), then I print it!

Converting integers to bitstrings

Then there’s the issue of converting integers to bitstrings. Here’s how.

void itob(uint16_t num, char *b, int len)
{
    for (int i=0; i<len; ++i)
    {
        b[len-i-1] = ((num & 1) == 1) ? '1' : '0';
        num >>= 1;
    }
}

void build_A_COMMAND(char *line_in, char *line_out)
{
    uint16_t i;

    i = atoi(line_in + 1);  // convert line[1:] to int
    i = i & 0x7fff;         // set MSB to 0 if i>32767
    itob(i, line_out, 16);  // convert i to 15+1-bit string and save to output
}

First, the build_A_COMMAND() function, which I call only after determining a line is an A Instruction (i.e., it starts with @). This function takes two strings as arguments – the assembly line it’s parsing and the binary line it’s building. It converts the input line (after the initial @ character) to an 16-bit integer, then it resets the most significant bit to 0 (a hack-y way of ensuring that the most significant bit is 0, but some better error-handling is needed here to validate that the input integer is max 15-bits to begin with). Lastly it calls itob(), which is my fun little integer-to-bitstring converter.

itob() checks to see if the least significant bit of the 16-bit integer above is 1, and then sets output string b[15] to 1 or 0 accordingly. Then it shifts the 16-bit integer right one bit and continues along for the full 16 bits, writing the bitstring from least [15] to most [0] significant bit as it goes.

Next steps

I’ve already scaffolded out the various functions for building C Instructions. That means the next big step is sorting out some sort of lookup table to connect a command component (e.g., a jump command like JNE) to its binary equivalen (in this case 101). In Python that was a simple matter of dictionaries. Here, I’m considering implementing a simple hash table to do that.

After that, it’s a matter of implementing functionality to handle symbols and labels, which requires another hash table.