Let’s continue with our problem of optimizing JIT(P) | JIT(P(M)) = P(N). Last time we found that our VM design M was flawed, and as such, so was JIT(M). By separating our VM into independent function calls, we lost an important optimization: a virtual machine is a state machine, not an external dispatcher.
The simplest way to make a state machine in C is to use a large switch() table. The idea is to encapsulate the entire functionality of the VM in one state of execution. This removes both memory accesses and calls on each opcode, which eases the processor’s branch prediction and localizes the important states.
By removing our oplist entry from m.h, we’ll make m2.h and m2.c. Our new Execute_M() looks like:
int Execute_M(ContextM *m, int *err)
{
unsigned char *cip = m->bytecode;
unsigned char reg;
char reg_a;
char reg_b;
m->err = 0;
while (!m->err)
{
switch (*cip++)
{
case OP_ADD:
reg_a += reg_b;
break;
case OP_MUL:
reg_a *= reg_b;
break;
case OP_SET:
reg = *cip++;
if (reg == REG_A)
reg_a = *cip++;
else
reg_b = *cip++;
break;
case OP_END:
m->err = ERR_HALT;
break;
}
}
if (err)
*err = m->err;
/* return register A as output */
return m->regs[REG_A];
}
It’s easy to see that this is more re-entrant and compact. By treating the context-specific variables as local rather than external, we’ve made a tight, fast little function for computing M2(). How much better does it run? As you’ll see near the end, it runs in about 74 seconds on the same machine as before. That’s almost three times improved!
We now have a much easier way to visualize a JIT (we’ll call it JIT2). The last JIT was rather hard to conceptualize — mostly because of how backwards and awkward it was. However, here it’s fairly easy to see: If we unroll the switch() statement and simply inline those little instructions, we could gain a lot of speed – a lot more than simply inlining calls to C functions. Furthermore, since we’re not inlining calls, we can do our own optimization on the assembly.
So, let’s make a tiny bit of assembly for each opcode we want to implement. For these instructions, we’ll make EBPContextM. With that, A,B = Offsets of ContextM::regs[REG_A,REG_B], and ERR = Offset of ContextM::err. This leaves us with most registers free. Note that we use 8bit instructions since M2() is an 8bit machine.
OP_ADD:
;get value of B 0x8A 0x45 0x?? mov al, [ebp+B] ;add it into A 0x00 0x45 0x?? add [ebp+A], al
OP_MUL:
;get value of A 0x8A 0x45 0x?? mov al, [ebp+A] ;get value of B 0x8A 0x55 0x?? mov dl, [ebp+B] ;signed multiply 0xF6 0xEA imul dl ;move result back into A 0x88 0x45 0x?? mov [ebp+A], al
OP_SET:
;move the value in. first ?? is reg, second ?? is value. 0xC6 0x45 0x?? 0x?? mov [ebp+??], ??
OP_END:
;store -1 in the err 0xC7 0x45 0x?? 0xFF mov [ebp+ERR], -1 0xFF 0xFF 0xFF ;pop ebp 0x5D pop ebp ;return 0xC3 ret
As before, we implement this by browsing every opcode in P(M), and writing out, byte for byte, the JIT(M) implementation for each instruction. Just as before, we’ll end up with a list of x86 instructions that are identical to running the switch loop – except with no looping.
Once finished, we’ve reduced our 32 byte per opcode JIT1 to at most 11 bytes per instruction for JIT2. Second, we’ve made a much more efficient unrolled output code. The JIT2 generated code is almost exactly what an optimized VM implementation would look like, complete with atomic implementations of the instructions. Even more important, we’ve completely eliminated branching! Our JIT, unlike the original VM, no longer has any jumps, which is the real benefit from unrolling.
I’ve made two implementations of JIT2: JIT2_CALL, which uses separate functions for writing, and JIT2_SWITCH, which writes using a switch table. They both generate the same code, the difference is in generation time. First let’s run my original implementation of JIT2, JIT2_CALL:
dvander@malice:~/_machine_m$ ./test2_call
Allocating 4915200 size code
Codesize: 14336015 bytes
JIT ran in 529 milliseconds
Error: -1, Value: 24
Execution time: 45 milliseconds
Error: -1, Value: 24
Execution time: 73 milliseconds
Wow! Our original JIT, running a large factor slower than the state machine, is now almost twice as fast as the VM! On top of that, the generated code size went from ~75MB to ~12MB, and the parse time was cut as well. Clearly, our small redesign of the virtual machine has immense benefits.
Now, running my switch implementation:
dvander@malice:~/_machine_m$ cc jit2_switch.c m2.c test2.c -O3 -otest2_switch dvander@malice:~/_machine_m$ ./test2_switch
Allocating 4915200 size code
Codesize: 14336015 bytes
JIT ran in 243 milliseconds
Error: -1, Value: 24
Execution time: 46 milliseconds
Error: -1, Value: 24
Execution time: 73 milliseconds
The run time was cut in half using switch() instead of external calls. For our purposes, we’re going to ignore the generated codesize when benchmarking. We’re going for expensive optimizations, and don’t mind using a little more memory. With that in mind, how much more efficient is it to run the JIT than the VM? Let’s do some quick calculations using Maple:
> jit1 := x -> 703 + 288*x;
> jit2 := x -> 529 + 45*x;
> jit3 := x -> 243 + 45*x;
> m2 := x -> 73*x;
> evalf(solve(jit2(x)=m2(x),x));
18.89285714
> evalf(solve(jit3(x)=m2(x),x));
8.678571429
We have to run JIT2 about 19 times before it becomes more efficient, and JIT3 only about 9 times. A quick graph of all four machines:
plot({jit1(x),jit2(x),jit3(x),m2(x)},x=0..75,y=0..7500,color=[red,blue,green,orange]);

I apologize about the poor quality, I can’t seem to get Maple to work right. The X axis is iterations, and the Y axis is total time. You can see the fastest line is our original mistake JIT1. The second fastest is our vanilla VM, M2. The third and fourth are our identical JIT2 implementations, the only difference being in their initial run time.
From this point onward, we’re going to ignore our original machine M1. We’ve established that the design was a mistake, and that M2 is probably the best clean, standardized implementation we can have in plain C. As an answer to last week’s bonus question, we could speed this up with a GCC-specific feature: implementing the case switches as labels, and storing the labels in an array. GCC then allows you to jump to the addresses in the array. Unfortunately, this is an extension and not part of the C standard. Without arbitrary jumps and lack of nested functions, ANSI C is limited to switching. So, M2 is now our reference implementation of M1.
Can we do better? For that work, we were able to clearly give a nice boost to the speed of M2. However, we can indeed optimize this even further. Next time we’ll take a look at using processor features to directly implement an unrolled VM, taking advantages of the way instructions work and the register layout. In the end, we’ll have a very close mapping of P(M) to P(N).
Today’s source code can be downloaded at: http://www.bailopan.net/articles/jit/3/
Bonus Question: In fact, we can reduce the running time of our VM to zero seconds for this specific program (bigtest.inc). Why is that?