Continuing with our JIT problem, we saw last time that by reorganizing the VM, and thus the JIT, we could gain a huge amount of speed. Today, we’ll look at how to more efficiently map the generated code to an x86 processor.
In the problem of mapping M to N we should get as close as possible to mimicking what a compiler would generate. We have to change how we think of the JIT again. The JIT isn’t just a VM reassembler – it’s a compiler. It should generate optimized, efficient code based on the input it’s given. It should work as closely to the native processor as it can.
So, our first step is to consider how we can bring M close to N, that is, x86. Let’s take a look at the basic state machine storage: the registers. In x86 there are eight registers. Each one has different “intended” purposes. For example, ECX is commonly used in looping, while EDI is used as a destination pointer. While confusing and archaic, we’re stuck with it. Below is a table of which registers are commonly used for arithmetic operations. (Recall that on x86, a register is 16bit by removing the E. Furthermore, the arithmetic registers have 8bit high/low portions, for example, AH/AL comprise AX.)
| x86 Reg | Arithmetic | Number | M Reg | Arithmetic | Number |
| EAX | Yes | 0 | A | Yes | 0 |
| EBX | No | 3 | B | Yes | 1 |
| ECX | Yes | 1 | |||
| EDX | Yes | 2 | |||
| ESI | No | 6 | |||
| EDI | No | 7 | |||
| EBP | No | 5 | |||
| ESP | No | 4 |
So, if we wanted to map M directly onto an x86 processor, we should map the A and B registers to x86’s EAX, ECX, or EDX. How do we choose from there? Let’s consider the x86 instructions we need: ADD, IMUL, and MOV.
ADD is fairly uninteresting. While it has a specific instruction for ADD AL, it only operates on immediate values, which isn’t very useful to us right now. However, we’ll put a little star on AL mapping to A.
Next, IMUL takes in a parameter, multiplies it by AL, and stores the result in AL. This is almost identical to what OP_MUL does, except OP_MUL is hardcoded to register B. Thus, it seems pretty clear that using EAX for A is a good idea.
What about B? Since IMUL destroys DX in modes other than 8bit, it’s probably a good idea to avoid using EDX. Furthermore, ECX has the same numeric code as B. Let’s use EAX=A, and ECX=B. Now let’s implement each opcode, again considering EBP as the ContextM.
OP_ADD:
;add a and b, result in a 0x00 0xC8 add al, cl
OP_MUL:
;multiply a, b, result in a 0xF6 0xE9 imul cl
OP_SET:
;move correct value in ;note the +?, opcode 0xB0 requires the register number ; added to it. luckily, our reg numbers are the same, ; so we can just do 0xB0+? 0xB0+? 0x?? mov al/cl, ??
OP_END:
;set error code 0xC7 0x45 0x?? 0xFF mov [ebp+ERR], -1 0xFF 0xFF 0xFF ;store result 0x88 0x45 0x?? mov [ebp+A], al ;pop and return 0x5D pop ebp 0xC3 ret
We’ll call this implementation JIT3. Note that our new implementation (not considering OP_END) has at most 2 bytes per instruction. That’s a huge step forward from 11 and 32, and M itself has a max of 3 byte instructions.
Running the test:
dvander@malice:~/_machine_m$ ./test3
Allocating 4915200 size code
Codesize: 4915218 bytes
JIT ran in 168 milliseconds
Error: -1, Value: 24
Execution time: 16 milliseconds
Error: -1, Value: 24
Execution time: 74 milliseconds
JIT3 is definitely the best so far. Not only is our codesize nearly identical, but we cut the generation time AND runtime down again. It runs about three times faster. The codesize is better mainly from our test program’s opcode distribution – it uses the right combination sized opcodes such that there’s no loss when converting. It’s just as possible that a program’s generated code could be much greater or much less than the original size.
Re-running our Maple tests again from last time:
> j3 := x-> 168 + 16*x;
> m2 := x -> 73*x;
> evalf(solve(j3(x)=m2(x),x));
2.947368421
After only about three runs, JIT3 becomes more efficient. A graph shows quite clearly how much better JIT3 is over its competitors and the original VM. Green is JIT1, red is M2, blue is JIT2, and black is JIT3.
plot({j1(x),j2(x),m2(x),j3(x)},x=0..75,y=0..10000,color=[black,red,green,blue]);

Can we go further? In fact, yes. Next article we’ll take a look at the next phase of a JIT: optimization by browsing. If a JIT is a compiler, then like any other compiler, it should be able to make intelligent dynamic changes or decisions, rather than a static mapping. By sacrificing browse time, we can make important optimizations to the generated code.
You can download today’s source code at: http://www.bailopan.net/articles/jit/4/
Answer to previous bonus question: Our input program generates a static answer. The result of ‘24′ will never change. We could simply return 24 in one instruction after an initial browsing for the result. Eventually we’ll change the VM slightly to make this fact less meaningful.
Bonus Question: In mapping M to N, what could we do if M had more registers than N?