8 minute read

Daily dispatches from my 12 weeks at the Recurse Center in Summer 2023

Yesterday and today I worked on my n2t assembler. It ended up being about 300 lines of Python, but may well have been the largest and project I’ve tackled thus far at RC since my focus has been on smaller things.

I took a two-phase approach. Phase 1: Just parse the assembly without dealing with symbols and labels. Phase 2: Deal with symbols and labels.

Phase 1 went fine – mostly just a matter of converting components of various assembly commands into their bitstring equivalents using lookup tables.

Phase 2? Well, Phase 2 turned out to have two phases. Phase 1 of Phase 2 consisted of dealing with symbols, which in my case meant creating a SymbolsTable class to manage a dictionary of symbols and and handle insertions of new symbols at the next available addresses in memory, etc. If in assembly one writes @my_var (thus loading address my_var into the A register), my assembler would need to 1) figure out that we were looking at a symbol; 2) check to see if my_var existed in the SymbolsTable table, in which case it would return the address associated with the my_var key; and 3) if the symbols table didn’t already include my_var, insert it at the next available address, and then return that. Excellent!

I thought I was done with Phase 2, but Phase 2 was not done with me. I forgot about labels! A label is something like (loop) or (reset) – a line of pseudocode that marks a particular location in the program so that the program can jump back to that point later using a jump command.

This required a bit of thought, since it was seeming like I’d need to do a first pass over the assembly code to 1) identify labels; 2) add the labels to the SymbolsTable lookup table with the number of the next line of the program as their values; and 3) remove this line from the code entirely, so that on the next pass, we wouldn’t have to deal with these lines again.

My first approach looked something liked this:

for line_num, line in enumerate(program):
    if command_type(program) == commands.L_COMMAND:
        label = parse_label(line)
        add_symbol(label, line_num)
        program.pop(line_num)

I looped through program (a list of assembly commands), which I enumerated so that I had access to both the index (line_num) and the the command (line) of this list. If a given command was identified as an L-command (meaning it was a label), then I needed to:

  • parse the line (in this case just get rid of the parentheses, so that (loop) -> loop)
  • add that parsed label to the SymbolTable dictionary
  • and then pop that list item off the program list

This last part I thought was a particularly clever solution to the main problem I was facing: get rid of loop commands once you’ve seen them, and ensure that loop commands are added to the SymbolTable with a value representing the program line to which they should point, which needed to shift according to the deletions of all the labels. I figured that popping off label lines as we went would (magically) affect the list enumeration in the correct way, so that line_num would effectively refer to the index of the mutated program list at any given moment. The solution wasn’t too helplessly naive, I don’t think, but did result in some unexpected behavior.

Before continuing, let me make clear what I mean with an example.

Here’s one of the sample nt2 assembly programs our assembler is tasked with asssembling:

   @R0
   D=M              // D = first number
   @R1
   D=D-M            // D = first number - second number
   @OUTPUT_FIRST
   D;JGT            // if D>0 (first is greater) goto output_first
   @R1
   D=M              // D = second number
   @OUTPUT_D
   0;JMP            // goto output_d
(OUTPUT_FIRST)
   @R0
   D=M              // D = first number
(OUTPUT_D)
   @R2
   M=D              // M[2] = D (greatest number)
(INFINITE_LOOP)
   @INFINITE_LOOP
   0;JMP            // infinite loop

This program returns the max of the values in at address @R0 and @R1. How it does that is unimportant. My main interest here are the three labels: (OUTPUT_FIRST), (OUTPUT_D), and (INFINITE_LOOP). (OUTPUT_FIRST) is at line 11, which is where we’d want to jump if, at line 6, D > 0. But once the l-command (OUTPUT_FIRST) is removed from the code, (OUTPUT_D) (which is at line 14) would need to point to line 13. And (INFINITE_LOOP), at line 17, would then need to point to line 15.

Let’s say, then, that we have the above program in the list format we’re expecting by this point.

for i, line in enumerate(program):
    print("{:<3}:{}".format(i + 1, line))
1  :R0
2  :D=M
3  :@R1
4  :D=D-M
5  :@OUTPUT_FIRST
6  :D;JGT
7  :@R1
8  :D=M
9  :@OUTPUT_D
10 :0;JMP
11 :(OUTPUT_FIRST)
12 :@R0
13 :D=M
14 :(OUTPUT_D)
15 :@R2
16 :M=D
17 :(INFINITE_LOOP)
18 :@INFINITE_LOOP
19 :0;JMP

There’s our program pretty-printed with 1-indexed line numbers.

Now let’s do the thing we were talking about in an even more simplified form:

labels = {}

for line_num, line in enumerate(program):
    if line.startswith('('):
    labels[line[1:-1]] = line_num + 1
    program.pop(line_nume)

# print labels dict
print(labels)

# pretty print mutated list
for i, line in enumerate(program):
    print("{:<3}:{}".format(i + 1, line))
{'OUTPUT_FIRST': 11, 'OUTPUT_D': 13, 'INFINITE_LOOP': 15}

1  :R0
2  :D=M
3  :@R1
4  :D=D-M
5  :@OUTPUT_FIRST
6  :D;JGT
7  :@R1
8  :D=M
9  :@OUTPUT_D
10 :0;JMP
11 :@R0
12 :D=M
13 :@R2
14 :M=D
15 :@INFINITE_LOOP
16 :0;JMP

And hey! That’s actually . . . what I was hoping for? Each time we pop off a bad line, the program list shifts by one, which means that the next index enumerated is . . . where we want it to be. I mean, it seems to work, doesn’t it?

That’s what I thought until it came time to assemble a 27,000-line test program, and at that point it became clear (after some debugging) that there were some label-related hijinks.

I ended up pairing with a more experienced recurser, who immediately zero’ed in on this weird list mutation part of the code, since trying to mutate lists in place I guess can have funky consequences, which in retrospect makes sense. Here’s a test we devised to see what’s happening:

x = ['a', 'b', 'c', 'd', 'e', 'f']

for i, v in enumerate(x):
    print("Loop # {}".format(i + 1))
    print("i = {}, v = {}".format(i, v))
    print(x)
    print("Popping index {}".format(i))
    x.pop(i)
    print()

print("Final list")
print(x)

Loop # 1
i = 0, v = a
['a', 'b', 'c', 'd', 'e', 'f']
Popping index 0

Loop # 2
i = 1, v = c
['b', 'c', 'd', 'e', 'f']
Popping index 1

Loop # 3
i = 2, v = e
['b', 'd', 'e', 'f']
Popping index 2

Final list
['b', 'd', 'f']

Now, this was not what I expected at first, but it does make sense. We can see that our index value is marching on as expected – it starts at 0, then 1, then 2. But something funny happens, since the list is shrinking at the same time that the i indexer is incrementing. After our third loop, i should be 3, but at this point our mutated list – currently ['b', 'd', 'f'] – only has three elements in it, and x[3] is out of range. So weirdly we end up iterating through the entire list (in the senes that i exceeds the list’s length), but we’ve done so without having actually seen everything.

In the case of the assembler, though, there’s a bigger issue, which is that back-to-back labels (which can occur) will not both be removed. (Also, the last item of the list will never be seen, although that shouldn’t be a problem in this case, since labels would never occur on the last line of an assembly program.)

Here’s a short example of the unexpected behavior:

program = ['a', '(b)', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']

for i, v in enumerate(program):
    print("Loop # {}".format(i + 1))
    print(program)
    print("i = {}, v = {}".format(i, v))
    if v.startswith('('):
        print("Popping index {}".format(i))
        program.pop(i)

    print()

print(program)
Loop # 1
['a', '(b)', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']
i = 0, v = a

Loop # 2
['a', '(b)', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']
i = 1, v = (b)
Popping index 1

Loop # 3
['a', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']
i = 2, v = d

Loop # 4
['a', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']
i = 3, v = e

Loop # 5
['a', '(c)', 'd', 'e', '(f)', 'g', '(h)', 'i']
i = 4, v = (f)
Popping index 4

Loop # 6
['a', '(c)', 'd', 'e', 'g', '(h)', 'i']
i = 5, v = (h)
Popping index 5

['a', '(c)', 'd', 'e', 'g', 'i']

Here our label (c) was skipped.

Perhaps my solution would have worked if back-to-back labeling was invalid, but as my pairing bud pointed out, it’s probably not best practice to mutate a list while you’re iterating through it because of weird, non-intuitive behaviour like this.

So instead of trying to pop l-commands off, I created a list no_labels and appended non-l-commands to it while processing the original program. This required a separate line-counter that incremented only on non-l-command lines so that we could associate the correct line number with the label.

lc = 0
no_labels = []

for line in program:
    if command_type(program) == commands.L_COMMAND:
        label = parse_label(line)
        add_symbol(label, lc)
    else:
        no_labels.append(line)
        lc += 1     # only increment on non-l-command lines

Other stuff

A slumpy week thus far with Spring 2 batch folk around less. Plus tackling larger, more intimidating projects (like the the assembler) has been a rough reality check. Plus SICP got really hard all of a sudden.

But also feeling like I’m turning it around today thanks to other recursers’ generosity and enthusiasm.

  • Made it to checkins this AM, which was a great reminder that there are plenty of familiar faces around in addition to the new RCers I’m excited to meet
  • Made it through Part 2 of #🧑‍💻 current batches > implement dns in a weekend with the DNS crew
  • Finished assembler
  • Met with a recurser and got a Django group going – starting to pivot towards some sort of web app project in the second half of batch.
  • Met with another recurser who walked me through some git and GitHub workflow stuff
  • Started the slow process of translating the DNS implementation to C