Writnig a JIT, Part 1 – VM Introduction

One of the most difficult things I had to do in my short programming career, before I even knew assembly, was understanding how a JIT works (specifically, the Small/Pawn JIT), so I decided to write a little article on some basic VM/JIT theory and how our exact implementation works. (Warning: I’m certainly no JIT expert but the basics are easy once you see them). I’m going to do this by going over some basics of emulation/virtual machines and then go forward. You should know C and the basics of a processor to understand this best.

Almost everyone knows what an emulator is (as in, “I have an emulator for NES games”). Under the hood, an emulator is a “Virtual Machine”, or “Abstract Machine”. It’s built to solve the problem of running a Program P, written for architecture M (in this case, the NES processor), on machine N (usually an x86/IA32 processor). For concreteness, that is: we want to translate P(M) to P(N), so we can run any program that could run on M, to another program that can run on N. The simplest way to do this is to build an emulator E(N), where E(N) can run P(M) natively. For implementing languages, you get to design L(M) (the language/makeup of M) as well as the actual implementation of M yourself. For machines you know nothing about, it requires reverse engineering.

As you can imagine, an emulator is usually much slower than its native machine, even if N is much faster than M. Why? Because an emulator E is not real hardware, and thus each instruction for M can map to dozens of instructions for N. This is where a JIT comes in. A JIT takes in P(M), and rather than constructing E(N) | P(M) is accepted by E(N), it constructs P(N) = JIT(P(M)). In other words, a JIT takes a program for M and outputs a program for N. Thus, rather than executing E(P(M)) (running P through an emulator), you can run P directly on N. This idea is extremely powerful, and the optimization is huge – far more than simply writing E(N) as efficiently as possible (say, in raw assembly of N). In even simpler terms, you could take any program and simply output a native executable.

Therefore, our goal with any programming language is not to build an emulator or machine for it – it’s to build a real, fully functional JIT, which translates the language from a virtual machine, to native code.

Enough of that, and time for some C. For the rest of this series I’ll be using a very simple, imaginary calculator called “M”. You should know basic processor layout coming into this, but a quick refresher: almost all processors have “registers”, or ultra high speed, temporary storage. The IA32 (x86) line of processors have eight general registers, that are 32bit each. Our machine M will only have two: A, and B, and they’re eight bit.

For M, we’ll have four instructions (opcodes): SET, ADD, MULTIPLY, and END. Each opcode is encoded as a single byte. SET accepts two parameters, a register number as a byte (0 for A, 1 for B) and another byte as a value to set. ADD accepts no parameters and adds B to A. MULTIPLY does the same for multiplication. END stops the machine. The most this machine can do is be a very primitive calculator. Let’s build a simple C program as an emulator for this machine.

/* Function call required for an opcode */
typedef void (*OPCODEM)(struct ContextM_s *);
/* Defines/Macros for machine */
#define REG_A 0
#define REG_B 1
#define OP_SET 0
#define OP_ADD 1
#define OP_MUL 2
#define OP_END 3
#define ERR_INVALID_INSTRUCTION 1
#define ERR_HALT -1

/* Structure of machine state */
typedef struct ContextM_s
{
   unsigned char regs[2]; /* registers A and B */
   int err;  /* error state */
   unsigned char *bytecode; /* opcodes */
   unsigned char *cip /* instruction pointer */
} ContextM;

/* The SET opcode */
void OpcodeM_Set(ContextM *m)
{
   /* get the register # and advance the code pointer */
   unsigned char reg = *(m->cip++);
   /* in a more rigorous  machine we would do bounds checking */
   /* get value off bytecode stream and advance code pointer */
   m->regs[reg] = *(m->cip++);
}

/* The ADD opcode */
void OpcodeM_Add(ContextM *m)
{
   m->regs[REG_A] += m->regs[REG_B];
}

/* The MULTIPLY opcode */
void OpcodeM_Multiply(ContextM *m)
{
   m->regs[REG_A] *= m->regs[REG_B];
}

/* The END opcode */
void OpcodeM_End(ContextM *m)
{
   /* set a halt state */
   m->err = ERR_HALT;
}

/* The opcode call table */
static OpcodeTable[4] = 
{
   OpcodeM_Set, 
   OpcodeM_Add,
   OpcodeM_Multiply,
   OpcodeM_End
};

int Execute_M(ContextM *m, int *err)
{
   OPCODEM opfunc;
   m->cip = m->bytecode;
   while (!m->err)
   {
      opfunc = OpcodeTable[*m->cip++];
      opfunc(m);
   }
   if (err)
      *err = m->err;

   /* return register A as output */
   return m->regs[REG_A];
}

This code should be fairly obvious. We’ve constructed a table which has a function call for every opcode. The Emulator, or Virtual Machine for M, follows these steps:

  1. Start the machine in any state, except set the “instruction pointer” to the beginning of the bytecode.
  2. Find the byte at the instruction pointer, and lookup its function call in the opcode table.
  3. Call the function. The function will either set an error state, or consume its parameters and align to the next instruction.
  4. If the function accepted, go back to step 2. Otherwise, halt the machine.

How can we test this? Split our code into

m.h
,
m.c
, and write a quick
test1.c
:

#include <stdio .h>
#include "m.h"

static unsigned char test_pcode[] =
{
        OP_SET, REG_A, 5,  /* A = 5 */
        OP_SET, REG_B, 2,  /* B = 2 */
        OP_ADD,            /* A += B, A = 7 */
        OP_SET, REG_B, 3,  /* B = 3 */
        OP_MUL,            /* A *= B, A = 21 */
        OP_ADD,            /* A += B, A = 24 */
        OP_END
};

int main()
{
        ContextM m;
        unsigned char pcode[50];
        int err, result;

        memset(&m, 0, sizeof(ContextM));
        m.bytecode = test_pcode;

        result = Execute_M(&m, &err);

        printf("Error: %d, Value: %d\n", err, result);
        printf("Expected: (%d,%d)\n", -1, 24);
}

Our simple input code is a simple, silly algorithm given in the comments. There, we’ve constructed an emulator for a simple machine! However, you’ll notice this is a lot of extra code. For each opcode, we’re making an entire function call – that is, pushing parameters on the stack, branching, popping — we’re also doing byte addressing which is slow. Lastly, there’s no error checking here for simplicity. An invalid opcode will surely crash our emulator. Correcting this would remove even more speed.

Next part in the series, we’ll inspect two methods of improving our machine. The first will be trying to JIT our machine by replacing only the

Execute_M()
function, which we’ll see is little better than
Execute_M()
itself. The second method will be replacing
Execute_M()
with a switch statement, which will lead to our eventual final JIT.

You can download today’s source code from: http://www.bailopan.net/articles/jit/1/

Bonus Question: This machine is, in fact, too simple to be meaningful. It needs two important abilities to be far more powerful (Turing Complete). What are they?

2 Responses to “Writnig a JIT, Part 1 – VM Introduction”

  1. PM says:

    It needs (un)conditional jumps!

    It would also be helpful to have more storage than 2 registers; for example a stack!

    Will I get a cookie? :(

Leave a Reply

You must be logged in to post a comment.