Writing a JIT, Part 5 – A New Machine

In attempting to write the next portion of the JIT article series, I realized that there was a roadblock: our primitive, two instruction machine was simply too limited. I decided to design an entirely new one, which we’ll dub “M3″. This is where the series will start getting a little faster paced.

M3 has a few fundamental differences from M2:

  • M3 is 32bit, M2 was 8bit. This gives us more mathematical power.
  • M3 has five registers, r1-r5, giving us more flexibility for temporary storage.
  • M3 has a stack, which gives us theoretically infinite storage.
  • M3 has conditional branching, which means we are turing complete (we can make decisions based on conditions).

The rest of this article will contain the design of M3 and why I made the design choices I did. Don’t worry — M3 is still very small! However, the bits of complexity we’ve added will make the problem of writing a JIT much more interesting.

The convention for listing opcodes for M3 will be:
OP_NAME [param1, param2] - [type1/type2]
Where type1/type2 are the encodings of the parameters expected, as one of:

  • r – register (r1-r5)
  • i – immediate (32bit value)
  • j – jump offset

My goal with M3 was to be able to write a better benchmark that wasn’t totally trivial. If you recall with M1, our benchmark could be reduced to a single number. So, I designed the language around a simple four function calculator. My test program, seen at the end, computes the quadratic formula. To save space, I’m only listing the important opcodes in M3.

M3 needed to be a calculator, so I started off with arithmetic operators.

  • OP_ADD [dst, src] [r/r] – Add src to dst.
  • OP_SUB [dst, src] [r/r] – Subtracts src from dst. Result will be saved in dst.
  • OP_MUL [dst, src] [r/r] – Multiply src with dst. Result will be saved in dst.
  • OP_DIV [dst, src] [r/r] – Divide dst with src. Result will be saved in dst.

Next, we needed basic storage operators to set values:

  • OP_CPY [dst,src] – [r/r] – Copies src to dst.
  • OP_SET [dst,imm] – [r/i] – Sets dst to an immediate value.
  • OP_PUR [reg] – [r] – Pushes a register onto the stack.
  • OP_POR [reg] – [r] – Pops a value off the stack and into a register.
  • OP_PAR [reg,num] – [r,i] – Gets an input parameter off the stack frame.

Lastly, for branching, I decided to skimp and make things a bit tougher for the programmer. There’s no COMPARE instruction, so branches are based off the value in R1. Also, branches are “PC relative”. This means that if the branch instruction is at position X, then the “PC” (program counter/instruction pointer), which points to the next instruction, will be at X+1. If the branch is taken, the offset will be added to the PC’s value of X+1. This is a minor technicality I’ve hit in previous articles, and is really only important if you’re looking at raw numbers.

  • OP_BRE/L/G [offs] – [j] – Branches to j if R1 is equal to, less than, or greater to 0, respectively.
  • OP_JMP [offs] – [j] – Branches to j unconditionally.
  • OP_END [i] – [i] – Halts the machine with code i.

Armed with this design, I wrote:

  • A quick machine for M3.
  • A very primitive assembler. It takes in a text file of M3 instructions, and outputs the binary encoding (where an opcode is 1 byte, register is 1 byte, jump is 2 bytes, and immediate value is 4 bytes).
  • A more elaborate test program, which allows the user to specify the initial parameters pushed on the VM’s stack, a binary M3 file to execute, and a number of times to run the benchmark.
  • An improved API over M2 by separating the concepts of “bytecode structures” from “execution contexts”. A binary file is read into a bytecode structure, then the user initializes an execution context by allocating their own stack space. This means one script can have multiple contexts.

Now I needed a reference benchmark – I chose the quadratic formula. After a bit of research I figured out how to find an approximate integer square root by using simple division and addition. This gave the computation a bit of looping. My final program “quad.m3″ takes in three numbers (a,b,c) and computes the roots of their quadratic equation. It returns the roots in r1 and r2, unless an error is encountered (imaginary root or divide by zero).

As a trick to this program, I made two versions that only differ by two instructions. One version uses a temporary register during the square root computation. Another pushes/pops on the stack. The difference will not be surprising: the stack version is considerably slower. For an input equation, I chose 3000*x^2 -12000*x -15000, looped 500,000 times. It returns the two roots in r1 and r2, which should be 5 and -1.

dvander@malice:~/_machine_m$ make && make clean
dvander@malice:~/_machine_m$ ./m3asm quad_reg.m3 quad_reg.m3x
dvander@malice:~/_machine_m$ ./m3asm quad_stack.m3 quad_stack.m3x
dvander@malice:~/_machine_m$ ./test quad_reg.m3x 500000 3000 -12000 -15000
TIME: 6589
Register values:
        r1=5    r2=-1
dvander@malice:~/_machine_m$ ./test quad_stack.m3x 500000 3000 -12000 -15000
TIME: 7488
Register values:
        r1=5    r2=-1

As expected, the stack version is much slower. Even this two instruction difference is important – unexpected little areas can be large bottlenecks to something that is being computed hundreds of thousands of times. Although interesting, we’re not really interested in this difference – we’re only interested in the speed of the JIT’s transformation. As such, we’ll use quad_stack.m3 as our reference benchmark, as it has stack operations.

Next week, we’ll take a look at the issues involved in building a JIT for this new beast. We’ll see that it’s not too different – until we get to branching.

The source code for today’s article (machine, assembler, benchmark suite) can be downloaded at:
http://www.bailopan.net/articles/jit/5/

Answer to last bonus question: The short answer is that you’d need to store the extra data somewhere else — or even all of it, depending on how simple you want to make your JIT. If you wanted to get really complicated, you could browse the code and determine ways to optimize the output register usage. However, a one to one mapping isn’t possible.

Bonus Question: BRE, BRL, and BRG, are the only inequality test instructions we have. Why is this all we need?

Update 4/30/2006 3:30PM: Fixed opcode listing reported by core|Greenberet.

Leave a Reply

You must be logged in to post a comment.