Porting my assembler to C, Part II

July 14, 2023 9 minute read

Daily dispatches from my 12 weeks at the Recurse Center in Summer 2023

I’ve been keeping that good Impossible Stuff Day energy going today and made excellent progress on my assembler project. It can’t handle labels or symbols yet, however it is happily parsing A instructions with number literals as well as the whole gamut of C instructions. It reads from a file, but for the sake of convenience it currently just emits machine code translations to stdout like so:

   @2 --> 0000000000000010
  D=A --> 1110110000010000
   @3 --> 0000000000000011
D=D+A --> 1110000010010000
   @0 --> 0000000000000000
  M=D --> 1110001100001000

Hey, cool!

Here are some of my solutions to the many little challenges along the way.

Trimming leading spaces from each line

This is a feature my previous iteration did not have, which meant that the assembler assembled very little of sample .asm files that indented with reckless abandon.

Here’s what I did:

 // trim leading spaces from `line_in[]`
 i = 0; j = 0;
 while (line_in[i] == ' ' || line_in[i] == '\t')
     i++;
 if (i > 0) {
     while (line_in[i] != '\0')
         line_in[j++] = line_in[i++];
     line_in[j] = '\0';
 }

I am using two pointers, i and j, which both start at zero. I march i along past every damn space and tab it sees and then stop. Presuming i has indeed done some marching, which is to say that it is no longer 0, I then march i (from wherever it is) and j (which is at zero) along until line[i] == '\0', and as I go I reassign whatever non-space character is at line_in[i] to line_in[j], in effect shifting the non-space portion of the string left.

Tokenizing C Instructions

Identifying A instructions is easy enough – they start with @. And identifying L instructions (i.e., labels) is also easy enough – they start with (. Everything else is a C instruction. Tokenizing a C instruction turned out to be a bit tricky, though, since they can consist of two or three parts.

The general format is <destination>=<computation>;<jump>, however a valid command need only have <destination> or <jump> (or both, although this is rare in practice). That means I’m often faced with a command that looks like M=M-1 (just the destination and computation commands) or that look like D;JNE (just the computation and jump commands).

At first I played around with strtok() and strsep(). strtok() in particular seemed promising, since you can pass it multiple characters to split on (in this case =;), and on successive iterations it returns the next chunk of the input string. The problem I encountered, though, was how to efficiently deal with the results. If all C Instructions had three tokens, that would have been one thing, but most of the time they end up consisting of just two tokens. How, then, to know if you’re getting a destination/computation pair or a computation/jump pair?

Eventually I abandoned this (although I’m sure there’s a good way of doing what I wanted to do) in favor of strchr(), which takes in a string and a search character and returns the index of the first occurrence of that character if it’s there or NULL if not. Here’s what the code looked like:

void tokenize(char *line, char *comp, char *dest, char *jump)
{
    char *equal_sign, *semicolon;

    // if there's an equal sign, we must have dest. and comp. tokens at least
    if ((equal_sign = strchr(line, '='))) {
        *equal_sign = '\0';
        strcpy(dest, line);             // copy everything up to the equal sign to dest
        strcpy(comp, equal_sign + 1);   // copy everything else to comp
        // if comp contains a semi-colon, we have to extract the jump command
        if ((semicolon = strchr(comp, ';'))) {
            *semicolon = '\0';          // terminate comp where the semicolon was
            strcpy(jump, semicolon + 1);// copy everything else to jump
        } else {
            strcpy(jump, "");           // else jump is NULL
        }
    // if there's no equal sign, we just have comp. and jump tokens
    } else {
        semicolon = strchr(line, ';');
        *semicolon = '\0';
        strcpy(dest, "");               // dest. is NULL
        strcpy(comp, line);             // copy everything up to semi-colon to comp
        strcpy(jump, semicolon + 1);    // copy everything else to jump
    }
}

I pass this function my line_in string, as well as empty strings to hold each of the three tokens. It is this function’s job to put stuff in those.

First, I check to see if the line has an equal sign. If so, then we assume it has destination and computation tokens at the very least, so we copy everything up to the = to dest and everything after it to comp. Then, we check comp for the presence of a semicolon. If there is one, we terminate comp where the semicolon was and copy everything after to jump. If not, then comp is complete and jump just needs to be set to NULL.

If, on the other hand, the input line does not have an equal sign, then we just go ahead and assume it only consists of computation and jump instructions. Same process here, except we set dest to NULL, comp to everything up to where the semicolon was, and jump to everything after.

When I was trying to figure this out, I made this little REPL version of the tokenizer, which looks like this:

Tokenizer Demo

Lookups

Once we have our tokens, we need some way of tranlsating them to their binary equivalents. For example, a 3-bit jump command like JLT is encoded as 100, and a 3-bit destination command like MD is encoded as 011. There are 8 permutations of both jump and destination commands, which is fitting for 3-bit numbers (\(2^3=8\)).

The computation commands consist of 7 bits, but mercifully there are not \(2^7\) permutations to deal with! Just 28 permutations in this version of assembly.

As another reminder, here’s what the C instruction encoding looks like:

1	x	x	a	c	c	c	c	c	c	d	d	d	j	j	j
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

C instruction bit: identifies the computation instruction as such
bits 13 and 14 aren’t used
a bit: dictates whether the ALU should accept its y input from the A Register or from somewhere in Memory. Effectively part of the computation instruction.
c bits: comutation instruction
d bits: destination instruction
j bits: jump instruction

When I initially wrote my assembler in Python, my approach was to use dictionaries as lookup tables to match each of the 8 destination instructions with their 3-bit binary counterparts, each of the 8 jump instructions with their 3-bit binary counterparts, and each of the 28 computation instructions with their 7-bit counterparts, sort of like this:

dest_table = {
   'M'  :'001',
   'D'  :'010',
   'MD' :'011',
   'A'  :'100',
   'AM' :'101',
   'AD' :'110',
   'AMD':'111'
   }
jump_table = {
   'JGT':'001',
   'JEQ':'010',
   'JGE':'011',
   'JLT':'100',
   'JNE':'101',
   'JLE':'110',
   'JMP':'111'
   }
comp_table = {
   '0'  :'0101010',
   '1'  :'0111111',
   '-1' :'0111010',
   'D'  :'0001100',
   'A'  :'0110000',
   'M'  :'1110000',
   '!D' :'0001101',
   '!A' :'0110001',
   '!M' :'1110001',
   '-D' :'0001111',
   '-A' :'0110011',
   '-M' :'1110011',
   'D+1':'0011111',
   'A+1':'0110111',
   'M+1':'1110111',
   'D-1':'0001110',
   'A-1':'0110010',
   'M-1':'1110010',
   'D+A':'0000010',
   'D+M':'1000010',
   'D-A':'0010011',
   'D-M':'1010011',
   'A-D':'0000111',
   'M-D':'1000111',
   'D&A':'0000000',
   'D&M':'1000000',
   'D|A':'0010101',
   'D|M':'1010101',
   }

Then it was just a matter of concatenating the strings that resulted.

dest = dest_table.get(d_token, '000')
jump = jump_table.get(j_token, '000')
comp = comp_table.get(c_token, '0000000')

c_instruction = ''.join(['111', comp, dest, jump])

C doesn’t have hashmaps built in, of course. I considered building the data structure, but then thought it was overkill for lookups from lists of just a handful of items. Instead, I decided to use the ol’ parallel arrays method (although truth be told this concept was brand new to me!). If dictionaries/hashmaps have key/value pairs, this other method uses one array to hold the keys and the other to hold the values. To find what you need, you do a linear-time search on the keys array, hold on to the index once you find it, and then index in to the values array in constant time to get your match. It’s a far cry from the awesome O(1) time-complexity that a hashmap boasts, but, you know what? It’s fine.

Here, then, is an example of one of my parallel arrays:

 char *dest_keys[] = { "M", "D", "MD", "A", "AM", "AD", "AMD" };
 int   dest_vals[] = { 1, 2, 3, 4, 5, 6, 7 };

And here is the matching lookup function:

int parse_dest(char *dest_command)
{
    int len = sizeof(dest_vals)/sizeof(dest_vals[0]);

    for (int i=0; i<len; ++i) {
        if (!strcmp(dest_command, dest_keys[i])) {
            return dest_vals[i];
        }
    }

    return 0;
}

(I’m realizing all of a sudden that, while I can’t consolidate the three pairs of parallel arrays into a single parralel array – since, for instance, some destination instructions and computation instructions are identical – I can get away with a single parsing function, since each one does the same thing! Duh.)

Anyway, the parsing function does the thing I described: it searches through the keys array for the destination token, and if it finds a match it returns the value at that same index of the values array. Else it returns 0.

Building the C Instruction bitwise

You may have noticed that the values array above did not consist of 3-bit strings but of integers. Aha! This is my final breakthrough of the day (which, incidentally, was a direct beneficiary of my excursions into network programming). In Python, as I mentioned, I just concatenated each of my little bitstrings together to form my C-instruction. Sounds tedious in C, though. Fortunately I realized I can build up the 16-bit integer I need using bitwise operations.

Here’s how I’d do it in Python:

# tokens
d_token = 5   # --> 101 for dest. token AM
c_token = 114 # --> 1110010 for comp. token M-1
j_token = 0   # --> 000 when jump is NULL

# c-instruction under construction
c_inst = 7   # --> 111, which is what our three most signficant bits will need to be

c_inst <<= 7      # make room for the 7-bit computation token
c_inst |= c_token # add in computation token
c_inst <<= 3      # make room for the 3-bit dest token
c_inst |= d_token # add in the destination token
c_inst <<= 3      # make room for the 3-bit jump token
c_inst |= j_token # add in the jump token

print(c_inst)                   # => 64680
print("{:016b}".format(c_inst)) # => 1111110010101000

Voila! Now our 16-bit C instruction is represented by a single 16-bit unsigned integer. Yes, I’ll eventually need to convert this to a 16-bit bitstring, but I already built that conversion function.

Here’s my C version:

void build_C_COMMAND(char *line_in, char *line_out)
{
    uint16_t out = 7;
    int dest, comp, jump;
    char dest_command[4] = {0};
    char comp_command[4] = {0};
    char jump_command[4] = {0};

    tokenize(line_in, comp_command, dest_command, jump_command);

    dest = parse_dest(dest_command);
    comp = parse_comp(comp_command);
    jump = parse_jump(jump_command);

    // set output bits
    out = 7; out <<= 7;         // set most signifiant 3 bits to 111
    out |= comp; out <<= 3;     // set 7 comp bits
    out |= dest; out <<= 3;     // set 3 dest bits
    out |= jump;                // set 3 jump bits
    itob(out, line_out, 16);    // convert to binstring
}

Next on the list is handling symbols and labels.

Zach Rottman, PhD

Trimming leading spaces from each line

Tokenizing C Instructions

Lookups

Building the C Instruction bitwise