Archive for April, 2006

Writing a JIT, Part 5 – A New Machine

Wednesday, April 26th, 2006

In attempting to write the next portion of the JIT article series, I realized that there was a roadblock: our primitive, two instruction machine was simply too limited. I decided to design an entirely new one, which we’ll dub “M3″. This is where the series will start getting a little faster paced.

M3 has a few fundamental differences from M2:

  • M3 is 32bit, M2 was 8bit. This gives us more mathematical power.
  • M3 has five registers, r1-r5, giving us more flexibility for temporary storage.
  • M3 has a stack, which gives us theoretically infinite storage.
  • M3 has conditional branching, which means we are turing complete (we can make decisions based on conditions).

The rest of this article will contain the design of M3 and why I made the design choices I did. Don’t worry — M3 is still very small! However, the bits of complexity we’ve added will make the problem of writing a JIT much more interesting.

The convention for listing opcodes for M3 will be:

OP_NAME  [param1, param2] - [type1/type2]

Where type1/type2 are the encodings of the parameters expected, as one of:

  • r – register (r1-r5)
  • i – immediate (32bit value)
  • j – jump offset

My goal with M3 was to be able to write a better benchmark that wasn’t totally trivial. If you recall with M1, our benchmark could be reduced to a single number. So, I designed the language around a simple four function calculator. My test program, seen at the end, computes the quadratic formula. To save space, I’m only listing the important opcodes in M3.

M3 needed to be a calculator, so I started off with arithmetic operators.

  • OP_ADD [dst, src] [r/r] – Add src to dst.
  • OP_SUB [dst, src] [r/r] – Subtracts src from dst. Result will be saved in dst.
  • OP_MUL [dst, src] [r/r] – Multiply src with dst. Result will be saved in dst.
  • OP_DIV [dst, src] [r/r] – Divide dst with src. Result will be saved in dst.

Next, we needed basic storage operators to set values:

  • OP_CPY [dst,src] – [r/r] – Copies src to dst.
  • OP_SET [dst,imm] – [r/i] – Sets dst to an immediate value.
  • OP_PUR [reg] – [r] – Pushes a register onto the stack.
  • OP_POR [reg] – [r] – Pops a value off the stack and into a register.
  • OP_PAR [reg,num] – [r,i] – Gets an input parameter off the stack frame.

Lastly, for branching, I decided to skimp and make things a bit tougher for the programmer. There’s no COMPARE instruction, so branches are based off the value in R1. Also, branches are “PC relative”. This means that if the branch instruction is at position X, then the “PC” (program counter/instruction pointer), which points to the next instruction, will be at X+1. If the branch is taken, the offset will be added to the PC’s value of X+1. This is a minor technicality I’ve hit in previous articles, and is really only important if you’re looking at raw numbers.

  • OP_BRE/L/G [offs] – [j] – Branches to j if R1 is equal to, less than, or greater to 0, respectively.
  • OP_JMP [offs] – [j] – Branches to j unconditionally.
  • OP_END [i] – [i] – Halts the machine with code i.

Armed with this design, I wrote:

  • A quick machine for M3.
  • A very primitive assembler. It takes in a text file of M3 instructions, and outputs the binary encoding (where an opcode is 1 byte, register is 1 byte, jump is 2 bytes, and immediate value is 4 bytes).
  • A more elaborate test program, which allows the user to specify the initial parameters pushed on the VM’s stack, a binary M3 file to execute, and a number of times to run the benchmark.
  • An improved API over M2 by separating the concepts of “bytecode structures” from “execution contexts”. A binary file is read into a bytecode structure, then the user initializes an execution context by allocating their own stack space. This means one script can have multiple contexts.

Now I needed a reference benchmark – I chose the quadratic formula. After a bit of research I figured out how to find an approximate integer square root by using simple division and addition. This gave the computation a bit of looping. My final program “quad.m3″ takes in three numbers (a,b,c) and computes the roots of their quadratic equation. It returns the roots in

, unless an error is encountered (imaginary root or divide by zero).

As a trick to this program, I made two versions that only differ by two instructions. One version uses a temporary register during the square root computation. Another pushes/pops on the stack. The difference will not be surprising: the stack version is considerably slower. For an input equation, I chose

3000*x^2 -12000*x -15000
, looped 500,000 times. It returns the two roots in r1 and r2, which should be 5 and -1.

[email protected]:~/_machine_m$ make && make clean
[email protected]:~/_machine_m$ ./m3asm quad_reg.m3 quad_reg.m3x
[email protected]:~/_machine_m$ ./m3asm quad_stack.m3 quad_stack.m3x
[email protected]:~/_machine_m$ ./test quad_reg.m3x 500000 3000 -12000 -15000
TIME: 6589
Register values:
        r1=5    r2=-1
[email protected]:~/_machine_m$ ./test quad_stack.m3x 500000 3000 -12000 -15000
TIME: 7488
Register values:
        r1=5    r2=-1

As expected, the stack version is much slower. Even this two instruction difference is important – unexpected little areas can be large bottlenecks to something that is being computed hundreds of thousands of times. Although interesting, we’re not really interested in this difference – we’re only interested in the speed of the JIT’s transformation. As such, we’ll use quad_stack.m3 as our reference benchmark, as it has stack operations.

Next week, we’ll take a look at the issues involved in building a JIT for this new beast. We’ll see that it’s not too different – until we get to branching.

The source code for today’s article (machine, assembler, benchmark suite) can be downloaded at:

Answer to last bonus question: The short answer is that you’d need to store the extra data somewhere else — or even all of it, depending on how simple you want to make your JIT. If you wanted to get really complicated, you could browse the code and determine ways to optimize the output register usage. However, a one to one mapping isn’t possible.

Bonus Question: BRE, BRL, and BRG, are the only inequality test instructions we have. Why is this all we need?

Update 4/30/2006 3:30PM: Fixed opcode listing reported by core|Greenberet.

Beware of your Casts

Monday, April 24th, 2006

Even after two years of non-stop C++ coding, I still can’t see casting problems right away. Today my roommate asked me to debug a C++ problem he was having with a homework assignment.

This function was returning the same output no matter what input it was given. He was outputting h to try and indentify the problem. After an hour or so, he gave up and asked me what was going wrong.

double Hhho(double T)
    double cp = 255/5000;
    double k = 500/6;
    double h = cp*T + k;
    cout << "H is: " << h << endl;
    return h;

So, I compiled it with -ggdb3 and ran it under gdb. The session went something like this:

> br Hhho
> run
(Breakpoint hit)
> n
double cp = 255/5000;
> n
double k = 500/6;
> print cp

At this point, I knew exactly what had happened – the compiler had divided 255/5000 as an integer and gotten 0, then assigned it to the double. Oops.

Two lessons learned: It’s better to break into the debugger first, rather than randomly printing out values. Second, if you’re going to print random values, it’s a wise idea to print all related values rather than just one. The result was composed of a number of data sets, as such, they should have been printed as well.

Nonetheless, what the compiler did was rather unintuitive, and I wouldn’t have assumed that functionality myself. Whether it’s a good idea or not, I’m not qualified to comment on. Just beware of your casts.

Writing a JIT, Part 4 – x86 Improvements

Friday, April 21st, 2006

Continuing with our JIT problem, we saw last time that by reorganizing the VM, and thus the JIT, we could gain a huge amount of speed. Today, we’ll look at how to more efficiently map the generated code to an x86 processor.

In the problem of mapping M to N we should get as close as possible to mimicking what a compiler would generate. We have to change how we think of the JIT again. The JIT isn’t just a VM reassembler – it’s a compiler. It should generate optimized, efficient code based on the input it’s given. It should work as closely to the native processor as it can.

So, our first step is to consider how we can bring M close to N, that is, x86. Let’s take a look at the basic state machine storage: the registers. In x86 there are eight registers. Each one has different “intended” purposes. For example, ECX is commonly used in looping, while EDI is used as a destination pointer. While confusing and archaic, we’re stuck with it. Below is a table of which registers are commonly used for arithmetic operations. (Recall that on x86, a register is 16bit by removing the E. Furthermore, the arithmetic registers have 8bit high/low portions, for example, AH/AL comprise AX.)

x86 Reg Arithmetic Number M Reg Arithmetic Number
EAX Yes 0 A Yes 0
EBX No 3 B Yes 1
ECX Yes 1
EDX Yes 2
ESI No 6
EDI No 7
EBP No 5
ESP No 4

So, if we wanted to map M directly onto an x86 processor, we should map the A and B registers to x86′s EAX, ECX, or EDX. How do we choose from there? Let’s consider the x86 instructions we need: ADD, IMUL, and MOV.

ADD is fairly uninteresting. While it has a specific instruction for ADD AL, it only operates on immediate values, which isn’t very useful to us right now. However, we’ll put a little star on AL mapping to A.

Next, IMUL takes in a parameter, multiplies it by AL, and stores the result in AL. This is almost identical to what OP_MUL does, except OP_MUL is hardcoded to register B. Thus, it seems pretty clear that using EAX for A is a good idea.

What about B? Since IMUL destroys DX in modes other than 8bit, it’s probably a good idea to avoid using EDX. Furthermore, ECX has the same numeric code as B. Let’s use EAX=A, and ECX=B. Now let’s implement each opcode, again considering EBP as the ContextM.


 ;add a and b, result in a
 0x00 0xC8       add al, cl


 ;multiply a, b, result in a
 0xF6 0xE9       imul cl


 ;move correct value in
 ;note the +?, opcode 0xB0 requires the register number
 ; added to it.  luckily, our reg numbers are the same,
 ; so we can just do 0xB0+?
 0xB0+? 0x??     mov al/cl, ??


 ;set error code
 0xC7 0x45 0x?? 0xFF   mov [ebp+ERR], -1
  0xFF 0xFF 0xFF
 ;store result
 0x88 0x45 0x??        mov [ebp+A], al
 ;pop and return
 0x5D                  pop ebp
 0xC3                  ret

We’ll call this implementation JIT3. Note that our new implementation (not considering OP_END) has at most 2 bytes per instruction. That’s a huge step forward from 11 and 32, and M itself has a max of 3 byte instructions.

Running the test:

[email protected]:~/_machine_m$ ./test3
Allocating 4915200 size code
Codesize: 4915218 bytes
JIT ran in 168 milliseconds
Error: -1, Value: 24
Execution time: 16 milliseconds
Error: -1, Value: 24
Execution time: 74 milliseconds

JIT3 is definitely the best so far. Not only is our codesize nearly identical, but we cut the generation time AND runtime down again. It runs about three times faster. The codesize is better mainly from our test program’s opcode distribution – it uses the right combination sized opcodes such that there’s no loss when converting. It’s just as possible that a program’s generated code could be much greater or much less than the original size.

Re-running our Maple tests again from last time:

> j3 := x-> 168 + 16*x;
> m2 := x -> 73*x;
> evalf(solve(j3(x)=m2(x),x));

After only about three runs, JIT3 becomes more efficient. A graph shows quite clearly how much better JIT3 is over its competitors and the original VM. Green is JIT1, red is M2, blue is JIT2, and black is JIT3.


Can we go further? In fact, yes. Next article we’ll take a look at the next phase of a JIT: optimization by browsing. If a JIT is a compiler, then like any other compiler, it should be able to make intelligent dynamic changes or decisions, rather than a static mapping. By sacrificing browse time, we can make important optimizations to the generated code.

You can download today’s source code at:

Answer to previous bonus question: Our input program generates a static answer. The result of ’24′ will never change. We could simply return 24 in one instruction after an initial browsing for the result. Eventually we’ll change the VM slightly to make this fact less meaningful.

Bonus Question: In mapping M to N, what could we do if M had more registers than N?

Writing a JIT, Part 3 – Faster Designs

Thursday, April 20th, 2006

Let’s continue with our problem of optimizing JIT(P) | JIT(P(M)) = P(N). Last time we found that our VM design M was flawed, and as such, so was JIT(M). By separating our VM into independent function calls, we lost an important optimization: a virtual machine is a state machine, not an external dispatcher.

The simplest way to make a state machine in C is to use a large

table. The idea is to encapsulate the entire functionality of the VM in one state of execution. This removes both memory accesses and calls on each opcode, which eases the processor’s branch prediction and localizes the important states.

By removing our oplist entry from m.h, we’ll make m2.h and m2.c. Our new Execute_M() looks like:

int Execute_M(ContextM *m, int *err)
   unsigned char *cip = m->bytecode;
   unsigned char reg;
   char reg_a;
   char reg_b;
   m->err = 0;
   while (!m->err)
      switch (*cip++)
         case OP_ADD:
            reg_a += reg_b;
         case OP_MUL:
            reg_a *= reg_b;
         case OP_SET:
            reg = *cip++;
            if (reg == REG_A)
               reg_a = *cip++;
               reg_b = *cip++;
         case OP_END:
            m->err = ERR_HALT;

   if (err)
      *err = m->err;

   /* return register A as output */
   return m->regs[REG_A];

It’s easy to see that this is more re-entrant and compact. By treating the context-specific variables as local rather than external, we’ve made a tight, fast little function for computing M2(). How much better does it run? As you’ll see near the end, it runs in about 74 seconds on the same machine as before. That’s almost three times improved!

We now have a much easier way to visualize a JIT (we’ll call it JIT2). The last JIT was rather hard to conceptualize — mostly because of how backwards and awkward it was. However, here it’s fairly easy to see: If we unroll the switch() statement and simply inline those little instructions, we could gain a lot of speed – a lot more than simply inlining calls to C functions. Furthermore, since we’re not inlining calls, we can do our own optimization on the assembly.

So, let’s make a tiny bit of assembly for each opcode we want to implement. For these instructions, we’ll make EBPContextM. With that, A,B = Offsets of ContextM::regs[REG_A,REG_B], and ERR = Offset of ContextM::err. This leaves us with most registers free. Note that we use 8bit instructions since M2() is an 8bit machine.


 ;get value of B
 0x8A 0x45 0x??     mov al, [ebp+B]
 ;add it into A
 0x00 0x45 0x??     add [ebp+A], al


 ;get value of A
 0x8A 0x45 0x??     mov al, [ebp+A]
 ;get value of B
 0x8A 0x55 0x??     mov dl, [ebp+B]
 ;signed multiply
 0xF6 0xEA          imul dl
 ;move result back into A
 0x88 0x45 0x??     mov [ebp+A], al


 ;move the value in.  first ?? is reg, second ?? is value.
 0xC6 0x45 0x?? 0x??  mov [ebp+??], ??


 ;store -1 in the err
 0xC7 0x45 0x?? 0xFF  mov [ebp+ERR], -1
  0xFF 0xFF 0xFF
 ;pop ebp
 0x5D                 pop ebp
 0xC3                 ret

As before, we implement this by browsing every opcode in P(M), and writing out, byte for byte, the JIT(M) implementation for each instruction. Just as before, we’ll end up with a list of x86 instructions that are identical to running the switch loop – except with no looping.

Once finished, we’ve reduced our 32 byte per opcode JIT1 to at most 11 bytes per instruction for JIT2. Second, we’ve made a much more efficient unrolled output code. The JIT2 generated code is almost exactly what an optimized VM implementation would look like, complete with atomic implementations of the instructions. Even more important, we’ve completely eliminated branching! Our JIT, unlike the original VM, no longer has any jumps, which is the real benefit from unrolling.

I’ve made two implementations of JIT2: JIT2_CALL, which uses separate functions for writing, and JIT2_SWITCH, which writes using a switch table. They both generate the same code, the difference is in generation time. First let’s run my original implementation of JIT2, JIT2_CALL:

[email protected]:~/_machine_m$ ./test2_call
Allocating 4915200 size code
Codesize: 14336015 bytes
JIT ran in 529 milliseconds
Error: -1, Value: 24
Execution time: 45 milliseconds
Error: -1, Value: 24
Execution time: 73 milliseconds

Wow! Our original JIT, running a large factor slower than the state machine, is now almost twice as fast as the VM! On top of that, the generated code size went from ~75MB to ~12MB, and the parse time was cut as well. Clearly, our small redesign of the virtual machine has immense benefits.

Now, running my switch implementation:

[email protected]:~/_machine_m$ cc jit2_switch.c m2.c test2.c -O3 -otest2_switch [email protected]:~/_machine_m$ ./test2_switch
Allocating 4915200 size code
Codesize: 14336015 bytes
JIT ran in 243 milliseconds
Error: -1, Value: 24
Execution time: 46 milliseconds
Error: -1, Value: 24
Execution time: 73 milliseconds

The run time was cut in half using switch() instead of external calls. For our purposes, we’re going to ignore the generated codesize when benchmarking. We’re going for expensive optimizations, and don’t mind using a little more memory. With that in mind, how much more efficient is it to run the JIT than the VM? Let’s do some quick calculations using Maple:

> jit1 := x -> 703 + 288*x;
> jit2 := x -> 529 + 45*x;
> jit3 := x -> 243 + 45*x;
> m2 := x -> 73*x;
> evalf(solve(jit2(x)=m2(x),x));
> evalf(solve(jit3(x)=m2(x),x));

We have to run JIT2 about 19 times before it becomes more efficient, and JIT3 only about 9 times. A quick graph of all four machines:


I apologize about the poor quality, I can’t seem to get Maple to work right. The X axis is iterations, and the Y axis is total time. You can see the fastest line is our original mistake JIT1. The second fastest is our vanilla VM, M2. The third and fourth are our identical JIT2 implementations, the only difference being in their initial run time.

From this point onward, we’re going to ignore our original machine M1. We’ve established that the design was a mistake, and that M2 is probably the best clean, standardized implementation we can have in plain C. As an answer to last week’s bonus question, we could speed this up with a GCC-specific feature: implementing the case switches as labels, and storing the labels in an array. GCC then allows you to jump to the addresses in the array. Unfortunately, this is an extension and not part of the C standard. Without arbitrary jumps and lack of nested functions, ANSI C is limited to switching. So, M2 is now our reference implementation of M1.

Can we do better? For that work, we were able to clearly give a nice boost to the speed of M2. However, we can indeed optimize this even further. Next time we’ll take a look at using processor features to directly implement an unrolled VM, taking advantages of the way instructions work and the register layout. In the end, we’ll have a very close mapping of P(M) to P(N).

Today’s source code can be downloaded at:

Bonus Question: In fact, we can reduce the running time of our VM to zero seconds for this specific program ( Why is that?

Gordon Freeman Plaza

Monday, April 3rd, 2006

As my roommate and I were walking to the Sin Lab, he noticed this little gem marked up on a school building. This is what you get in an engineering school. Sorry about the poor camera quality.

That is all. JIT article will continue in a few days, they take a while to write.