Writing a JIT, Part 2 – Unrolling

Last time, I introduced a very simple emulator. It could set the values of two registers, then add or multiply them. Today, we’ll see why this VM implementation was flawed, and why converting it directly to a JIT is also wrong. Our JIT target is x86.

The idea behind the VM from last article was simply: “We’ve encountered an opcode. Call the function to decode it, and give it access to our local states.” The first thing that should strike you about this is that our ContextM handle contains execution-specific states. We don’t have a separate data structure for runtime info, and this means our VM isn’t re-entrant. We’ll patch this up in M2.

So, assuming that this is the basis for implementing a VM, let’s benchmark it and get an idea of where we are. To benchmark it, I copied our little program a few thousand times over, then wrote a routine to duplicate it 50 times. Our final program is a whopping 5MB in size, although it repeats the same dozen or so instructions over and over.

The benchmark source code is included at the end of the article. For now, I’ll just give the results. On my P2 2x500MHz machine, Program P as a function of MP(M) took 217ms. Not bad for 5MB of code, eh? We can do better. Specifically, the first thing we want to do is “unroll” the while loop of

. That way we give the processor a single stream of operations to cache. Our idea is to generate a series of function calls like this:


Let’s write a primitive JIT that simply unrolls the while loop to a series of instructions in this manner. Our conjecture is that this is faster than looping through each opcode, looking up its function, and then calling it. For the JIT, this requires knowing the opcode’s function ahead of time, which the JIT does not. So we add this to


typedef struct ContextM_s
   OPCODEM *oplist;
[ ... ]

int Initialize_M(ContextM *m);

This change to M lets the JIT lookup the opcode function. Now, if a ContextM is properly initialized, the JIT will be able to simply browse the opcode list and write out each call to memory.

Remember that the JIT is dynamically generating a function call which is equivalent to running the VM written in C. As such, we need an epilogue and prologue. I’m going to take a shortcut here: we’ll save the EBP register, and use it store the address of the ContextM. So, our function will be hardcoded to a ContextM instance, but for simplicity we don’t have to deal with the stack much.

   55              push ebp
   BD 78 56 34 12  mov ebp, 012345678h  ;?val? - store ContextM address here
   5D              pop ebp
   C3              ret

I’ve run the assembly through NASM (nasm -f elf test.asm -otest.o ; objdump -d test.o) and placed the exact encoding with the text. (The NASM documentation tends to be a helpful supplement to the Intel documentation.) Note that I used a bogus value for

mov ebp, ?val?
. This is a notation I’ll be using to signify that the value will be generated by the JIT. It also helps see how the data is encoded. Cache this fact in your head: our epilogue and prologue are 8 bytes total.

Now, let’s mockup how an opcode call will look. We need to give our ContextM to each opcode function, so that means we’ll be doing some basic stack ops. We’ll also need to call the function, which we’ll implement as an EIP relative call. Recall – EIP relative means the immediate value will be added to the EIP. The immediate value can be negative (backwards jump) or positive (forwards jump). Also, we’ll need to check the error state of our ContextM after EACH call, in case that call decided to abort (remember, the while loop checked this after every opcode, and we’re just unrolling). Lastly, and most importantly, we also need to keep ContextM’s

up to date. If its code instruction pointer is invalid, every single opcode could fail to read correctly (in our case, only OP_SET reads from the instruction pointer). Our assembly per opcode will be 32 bytes like this:

    ;push ContextM onto the stack
   0:   55                      push   ebp
    ;call the opcode - we'll write this ?val? in later.
   1:   e8 74 56 34 12          call   1234567a
    ;correct the stack
   6:   81 c4 04 00 00 00       add    esp,0x4
    ;Add 1 to ContextM::cip -- fill in the correct ?offset? later
   c:   81 45 55 01 00 00 00    add    DWORD PTR [ebp+55],0x1
    ;Compare ContextM::err to 0 -- fill in the correct ?offset? later
  13:   81 7d 55 00 00 00 00    cmp    DWORD PTR [ebp+55],0x0
    ;Jump to end if not equal.  fill in correct ?offset? to end later.
  1a:   0f 85 00 00 00 00       jne    20 <main+x>

In C, this is equivalent to (roughly):

if (m->err != 0)
   goto end;
/* etc, one of these blocks for every opcode in the source program */

From this disassembly, we now know exactly the size of ONE opcode, and we know the size our header/footer. We are ready to generate a function which is equivalent to our while loop, except completely unrolled! First, we have to browse the input of P(M) and calculate how many opcodes there are, so we know how much memory to allocate. This is fairly trivial, as we can just browse the opcodes and count how many they are. Our machine design allows us to stop when the ith instruction is OP_END.

Now, we know C = 8. ∀ cP(M) | cI(M), C = C + 32. Okay, I admit, I just wanted to play with Unicode. Our code size will be: C = 8 + 32 * opcodes. Now we get to allocate our executable code and then write our assembly. We’re going to do this in the worst possible way: simply assigning hex values to a pointer. Rather than spend time developing some sort of JITWriter structure, I’m going to assume that you know how to copy bitstreams into a character array.

The tricky part here will be generating EIP relative jumps. Recall that the EIP of a processor is always the NEXT instruction. So if we’re writing an EIP relative call at position N, the EIP will actually be relative to N+5. Therefore, our 0xE8 calls will be (Target – (N+5)). Similarly, our end jumps will be (Target – (N+6)). With the jumps, the target will always be the same: (N+O-2), where N is the function start, O is the code size, and -2 accounts for the final two instructions. Therefore, this gets us to the exact position of where we want to jump to: ((N+O-2) – (N+6)).

To spare your scroll wheel, read my implementation here. As said before, I’m simply doing byte copying – we’re uninterested in pretty code right now.

Phew! We’re done. We have a non-optimizing JIT that unrolls our bytecode in a series of external opcode calls. Now, to benchmark it – we want to know both the time the JIT takes to compile, its execution time, and the normal VM’s execution time. My benchmark ran:

[email protected]:~/_machine_m$ cc jit.c m.c test2.c -O3 -otest2
[email protected]:~/_machine_m$ ./test2
Allocating 4915200 size code
Codesize: 78643240 bytes
JIT ran in 703 milliseconds
Error: -1, Value: 24
Execution time: 288 milliseconds
Error: -1, Value: 24
Execution time: 217 milliseconds

First of all, notice that unrolling led to a 70ms INCREASE in speed. Second, that if we take our two execution models as functions of time: jit1(x) = 288*x + 703, m1(x) = 210*x, it’s apparent that our JIT will never be as fast as our VM. We also somehow managed to use over 75MB of memory and 1.5 realtime seconds. Not good. Why did this trajesty occur? We spent all this time on a JIT, trying to make the VM as fast as possible, and utterly failed.

Simply, we didn’t do anything. The heart of the virtual machine is not, in fact, the loop – it’s the instruction set. By unrolling the loop and not caring about the instructions, the only thing we did was vastly increase the codesize by a factor of 15. In fact, our JIT is so simple, it’s as if we compiled the VM ourselves and wrote the code out. With high levels of optimization, the compiler is smarter us, especially since we have to specifically use an unpredicted branch after every opcode. Loop or not, the VM is still making a branch (call or jump) for every opcode.

While our concept of unrolling isn’t wrong, in fact, the design of the VM is. Next time, I’ll introduce a switch()ing VM, and this redesign will lead to a JIT that’s about four-five times faster than the VM itself.

Answer to last article’s bonus question: PM correctly asserted that it needs conditional jumping (a delta function that can move around), and memory. Specifically, it needs infinite memory, and although it doesn’t have to be random access, a stack isn’t sufficient.

This week’s bonus question: In fact, the VM design I chose was quite peculiar in that it wasn’t exactly practical in terms of efficiency. I picked it as a contrived demonstration. What important feature is the C standard lacking that would allow us to compartmentalize opcode implementations without using an entire static function? (There are a few answers).

Leave a Reply

You must be logged in to post a comment.