Archive for March, 2006

Writing a JIT, Part 2 – Unrolling

Friday, March 31st, 2006

Last time, I introduced a very simple emulator. It could set the values of two registers, then add or multiply them. Today, we’ll see why this VM implementation was flawed, and why converting it directly to a JIT is also wrong. Our JIT target is x86.

The idea behind the VM from last article was simply: “We’ve encountered an opcode. Call the function to decode it, and give it access to our local states.” The first thing that should strike you about this is that our ContextM handle contains execution-specific states. We don’t have a separate data structure for runtime info, and this means our VM isn’t re-entrant. We’ll patch this up in M2.

So, assuming that this is the basis for implementing a VM, let’s benchmark it and get an idea of where we are. To benchmark it, I copied our little program a few thousand times over, then wrote a routine to duplicate it 50 times. Our final program is a whopping 5MB in size, although it repeats the same dozen or so instructions over and over.

The benchmark source code is included at the end of the article. For now, I’ll just give the results. On my P2 2x500MHz machine, Program P as a function of MP(M) took 217ms. Not bad for 5MB of code, eh? We can do better. Specifically, the first thing we want to do is “unroll” the while loop of

Execute_M()
. That way we give the processor a single stream of operations to cache. Our idea is to generate a series of function calls like this:

Op_Set(m);
Op_Set(m);
Op_Add(m);
Op_End(m);

Let’s write a primitive JIT that simply unrolls the while loop to a series of instructions in this manner. Our conjecture is that this is faster than looping through each opcode, looking up its function, and then calling it. For the JIT, this requires knowing the opcode’s function ahead of time, which the JIT does not. So we add this to

m2.h
:

typedef struct ContextM_s
{
   OPCODEM *oplist;
[ ... ]

int Initialize_M(ContextM *m);

This change to M lets the JIT lookup the opcode function. Now, if a ContextM is properly initialized, the JIT will be able to simply browse the opcode list and write out each call to memory.

Remember that the JIT is dynamically generating a function call which is equivalent to running the VM written in C. As such, we need an epilogue and prologue. I’m going to take a shortcut here: we’ll save the EBP register, and use it store the address of the ContextM. So, our function will be hardcoded to a ContextM instance, but for simplicity we don’t have to deal with the stack much.

   55              push ebp
   BD 78 56 34 12  mov ebp, 012345678h  ;?val? - store ContextM address here
   [...]
   5D              pop ebp
   C3              ret

I’ve run the assembly through NASM (nasm -f elf test.asm -otest.o ; objdump -d test.o) and placed the exact encoding with the text. (The NASM documentation tends to be a helpful supplement to the Intel documentation.) Note that I used a bogus value for

mov ebp, ?val?
. This is a notation I’ll be using to signify that the value will be generated by the JIT. It also helps see how the data is encoded. Cache this fact in your head: our epilogue and prologue are 8 bytes total.

Now, let’s mockup how an opcode call will look. We need to give our ContextM to each opcode function, so that means we’ll be doing some basic stack ops. We’ll also need to call the function, which we’ll implement as an EIP relative call. Recall – EIP relative means the immediate value will be added to the EIP. The immediate value can be negative (backwards jump) or positive (forwards jump). Also, we’ll need to check the error state of our ContextM after EACH call, in case that call decided to abort (remember, the while loop checked this after every opcode, and we’re just unrolling). Lastly, and most importantly, we also need to keep ContextM’s

cip
up to date. If its code instruction pointer is invalid, every single opcode could fail to read correctly (in our case, only OP_SET reads from the instruction pointer). Our assembly per opcode will be 32 bytes like this:

    ;push ContextM onto the stack
   0:   55                      push   ebp
    ;call the opcode - we'll write this ?val? in later.
   1:   e8 74 56 34 12          call   1234567a
    ;correct the stack
   6:   81 c4 04 00 00 00       add    esp,0x4
    ;Add 1 to ContextM::cip -- fill in the correct ?offset? later
   c:   81 45 55 01 00 00 00    add    DWORD PTR [ebp+55],0x1
    ;Compare ContextM::err to 0 -- fill in the correct ?offset? later
  13:   81 7d 55 00 00 00 00    cmp    DWORD PTR [ebp+55],0x0
    ;Jump to end if not equal.  fill in correct ?offset? to end later.
  1a:   0f 85 00 00 00 00       jne    20 <main+x>

In C, this is equivalent to (roughly):

opfunc(m);
m->cip++;
if (m->err != 0)
   goto end;
/* etc, one of these blocks for every opcode in the source program */

From this disassembly, we now know exactly the size of ONE opcode, and we know the size our header/footer. We are ready to generate a function which is equivalent to our while loop, except completely unrolled! First, we have to browse the input of P(M) and calculate how many opcodes there are, so we know how much memory to allocate. This is fairly trivial, as we can just browse the opcodes and count how many they are. Our machine design allows us to stop when the ith instruction is OP_END.

Now, we know C = 8. ∀ cP(M) | cI(M), C = C + 32. Okay, I admit, I just wanted to play with Unicode. Our code size will be: C = 8 + 32 * opcodes. Now we get to allocate our executable code and then write our assembly. We’re going to do this in the worst possible way: simply assigning hex values to a pointer. Rather than spend time developing some sort of JITWriter structure, I’m going to assume that you know how to copy bitstreams into a character array.

The tricky part here will be generating EIP relative jumps. Recall that the EIP of a processor is always the NEXT instruction. So if we’re writing an EIP relative call at position N, the EIP will actually be relative to N+5. Therefore, our 0xE8 calls will be (Target – (N+5)). Similarly, our end jumps will be (Target – (N+6)). With the jumps, the target will always be the same: (N+O-2), where N is the function start, O is the code size, and -2 accounts for the final two instructions. Therefore, this gets us to the exact position of where we want to jump to: ((N+O-2) – (N+6)).

To spare your scroll wheel, read my implementation here. As said before, I’m simply doing byte copying – we’re uninterested in pretty code right now.

Phew! We’re done. We have a non-optimizing JIT that unrolls our bytecode in a series of external opcode calls. Now, to benchmark it – we want to know both the time the JIT takes to compile, its execution time, and the normal VM’s execution time. My benchmark ran:

[email protected]:~/_machine_m$ cc jit.c m.c test2.c -O3 -otest2
[email protected]:~/_machine_m$ ./test2
Allocating 4915200 size code
Codesize: 78643240 bytes
JIT ran in 703 milliseconds
Error: -1, Value: 24
Execution time: 288 milliseconds
Error: -1, Value: 24
Execution time: 217 milliseconds

First of all, notice that unrolling led to a 70ms INCREASE in speed. Second, that if we take our two execution models as functions of time: jit1(x) = 288*x + 703, m1(x) = 210*x, it’s apparent that our JIT will never be as fast as our VM. We also somehow managed to use over 75MB of memory and 1.5 realtime seconds. Not good. Why did this trajesty occur? We spent all this time on a JIT, trying to make the VM as fast as possible, and utterly failed.

Simply, we didn’t do anything. The heart of the virtual machine is not, in fact, the loop – it’s the instruction set. By unrolling the loop and not caring about the instructions, the only thing we did was vastly increase the codesize by a factor of 15. In fact, our JIT is so simple, it’s as if we compiled the VM ourselves and wrote the code out. With high levels of optimization, the compiler is smarter us, especially since we have to specifically use an unpredicted branch after every opcode. Loop or not, the VM is still making a branch (call or jump) for every opcode.

While our concept of unrolling isn’t wrong, in fact, the design of the VM is. Next time, I’ll introduce a switch()ing VM, and this redesign will lead to a JIT that’s about four-five times faster than the VM itself.

Answer to last article’s bonus question: PM correctly asserted that it needs conditional jumping (a delta function that can move around), and memory. Specifically, it needs infinite memory, and although it doesn’t have to be random access, a stack isn’t sufficient.

This week’s bonus question: In fact, the VM design I chose was quite peculiar in that it wasn’t exactly practical in terms of efficiency. I picked it as a contrived demonstration. What important feature is the C standard lacking that would allow us to compartmentalize opcode implementations without using an entire static function? (There are a few answers).

Writnig a JIT, Part 1 – VM Introduction

Tuesday, March 28th, 2006

One of the most difficult things I had to do in my short programming career, before I even knew assembly, was understanding how a JIT works (specifically, the Small/Pawn JIT), so I decided to write a little article on some basic VM/JIT theory and how our exact implementation works. (Warning: I’m certainly no JIT expert but the basics are easy once you see them). I’m going to do this by going over some basics of emulation/virtual machines and then go forward. You should know C and the basics of a processor to understand this best.

Almost everyone knows what an emulator is (as in, “I have an emulator for NES games”). Under the hood, an emulator is a “Virtual Machine”, or “Abstract Machine”. It’s built to solve the problem of running a Program P, written for architecture M (in this case, the NES processor), on machine N (usually an x86/IA32 processor). For concreteness, that is: we want to translate P(M) to P(N), so we can run any program that could run on M, to another program that can run on N. The simplest way to do this is to build an emulator E(N), where E(N) can run P(M) natively. For implementing languages, you get to design L(M) (the language/makeup of M) as well as the actual implementation of M yourself. For machines you know nothing about, it requires reverse engineering.

As you can imagine, an emulator is usually much slower than its native machine, even if N is much faster than M. Why? Because an emulator E is not real hardware, and thus each instruction for M can map to dozens of instructions for N. This is where a JIT comes in. A JIT takes in P(M), and rather than constructing E(N) | P(M) is accepted by E(N), it constructs P(N) = JIT(P(M)). In other words, a JIT takes a program for M and outputs a program for N. Thus, rather than executing E(P(M)) (running P through an emulator), you can run P directly on N. This idea is extremely powerful, and the optimization is huge – far more than simply writing E(N) as efficiently as possible (say, in raw assembly of N). In even simpler terms, you could take any program and simply output a native executable.

Therefore, our goal with any programming language is not to build an emulator or machine for it – it’s to build a real, fully functional JIT, which translates the language from a virtual machine, to native code.

Enough of that, and time for some C. For the rest of this series I’ll be using a very simple, imaginary calculator called “M”. You should know basic processor layout coming into this, but a quick refresher: almost all processors have “registers”, or ultra high speed, temporary storage. The IA32 (x86) line of processors have eight general registers, that are 32bit each. Our machine M will only have two: A, and B, and they’re eight bit.

For M, we’ll have four instructions (opcodes): SET, ADD, MULTIPLY, and END. Each opcode is encoded as a single byte. SET accepts two parameters, a register number as a byte (0 for A, 1 for B) and another byte as a value to set. ADD accepts no parameters and adds B to A. MULTIPLY does the same for multiplication. END stops the machine. The most this machine can do is be a very primitive calculator. Let’s build a simple C program as an emulator for this machine.

/* Function call required for an opcode */
typedef void (*OPCODEM)(struct ContextM_s *);
/* Defines/Macros for machine */
#define REG_A 0
#define REG_B 1
#define OP_SET 0
#define OP_ADD 1
#define OP_MUL 2
#define OP_END 3
#define ERR_INVALID_INSTRUCTION 1
#define ERR_HALT -1

/* Structure of machine state */
typedef struct ContextM_s
{
   unsigned char regs[2]; /* registers A and B */
   int err;  /* error state */
   unsigned char *bytecode; /* opcodes */
   unsigned char *cip /* instruction pointer */
} ContextM;

/* The SET opcode */
void OpcodeM_Set(ContextM *m)
{
   /* get the register # and advance the code pointer */
   unsigned char reg = *(m->cip++);
   /* in a more rigorous  machine we would do bounds checking */
   /* get value off bytecode stream and advance code pointer */
   m->regs[reg] = *(m->cip++);
}

/* The ADD opcode */
void OpcodeM_Add(ContextM *m)
{
   m->regs[REG_A] += m->regs[REG_B];
}

/* The MULTIPLY opcode */
void OpcodeM_Multiply(ContextM *m)
{
   m->regs[REG_A] *= m->regs[REG_B];
}

/* The END opcode */
void OpcodeM_End(ContextM *m)
{
   /* set a halt state */
   m->err = ERR_HALT;
}

/* The opcode call table */
static OpcodeTable[4] = 
{
   OpcodeM_Set, 
   OpcodeM_Add,
   OpcodeM_Multiply,
   OpcodeM_End
};

int Execute_M(ContextM *m, int *err)
{
   OPCODEM opfunc;
   m->cip = m->bytecode;
   while (!m->err)
   {
      opfunc = OpcodeTable[*m->cip++];
      opfunc(m);
   }
   if (err)
      *err = m->err;

   /* return register A as output */
   return m->regs[REG_A];
}

This code should be fairly obvious. We’ve constructed a table which has a function call for every opcode. The Emulator, or Virtual Machine for M, follows these steps:

  1. Start the machine in any state, except set the “instruction pointer” to the beginning of the bytecode.
  2. Find the byte at the instruction pointer, and lookup its function call in the opcode table.
  3. Call the function. The function will either set an error state, or consume its parameters and align to the next instruction.
  4. If the function accepted, go back to step 2. Otherwise, halt the machine.

How can we test this? Split our code into

m.h
,
m.c
, and write a quick
test1.c
:

#include <stdio .h>
#include "m.h"

static unsigned char test_pcode[] =
{
        OP_SET, REG_A, 5,  /* A = 5 */
        OP_SET, REG_B, 2,  /* B = 2 */
        OP_ADD,            /* A += B, A = 7 */
        OP_SET, REG_B, 3,  /* B = 3 */
        OP_MUL,            /* A *= B, A = 21 */
        OP_ADD,            /* A += B, A = 24 */
        OP_END
};

int main()
{
        ContextM m;
        unsigned char pcode[50];
        int err, result;

        memset(&m, 0, sizeof(ContextM));
        m.bytecode = test_pcode;

        result = Execute_M(&m, &err);

        printf("Error: %d, Value: %d\n", err, result);
        printf("Expected: (%d,%d)\n", -1, 24);
}

Our simple input code is a simple, silly algorithm given in the comments. There, we’ve constructed an emulator for a simple machine! However, you’ll notice this is a lot of extra code. For each opcode, we’re making an entire function call – that is, pushing parameters on the stack, branching, popping — we’re also doing byte addressing which is slow. Lastly, there’s no error checking here for simplicity. An invalid opcode will surely crash our emulator. Correcting this would remove even more speed.

Next part in the series, we’ll inspect two methods of improving our machine. The first will be trying to JIT our machine by replacing only the

Execute_M()
function, which we’ll see is little better than
Execute_M()
itself. The second method will be replacing
Execute_M()
with a switch statement, which will lead to our eventual final JIT.

You can download today’s source code from: http://www.bailopan.net/articles/jit/1/

Bonus Question: This machine is, in fact, too simple to be meaningful. It needs two important abilities to be far more powerful (Turing Complete). What are they?

To All Interested Parties

Wednesday, March 22nd, 2006

Tomorrow, 3/23/2006, at 5PM EST (GMT-5), I will be hosting a formal IRC chat session on the future of SourceMod. This was originally intended for just us as an internal group but I’ve decided to extend it to the public.

The chat will be at irc.gamesurge.net, channel: #sourcemod.dev

It might seem self serving but the result of the meeting will decide the future of the project. It depends on the input we get from the community and how it lines up to our own goals.

The meeting agenda:

  • Introduce the major players in the “hardcoded” SM team
  • Discuss our real goals behind SourceMod and why, as a project, it initially failed
  • Discuss the problem of programming language
  • Discuss the timeframe scenarios involved in a final product
  • Discuss the problem of working with the HL2SDK
  • Give project lifecycle projections (timelines and scenarios)
  • Introduce the options there are to continue the project
  • Have a Q&A Session
  • Open the floor to public input
  • Whether or not anyone shows up, this will be taking place amongst the remaining developers. You’re welcome to join in.

    Note: chat will be moderated. We will be posting a log of the chat session after we’re done.

    Directions:

  • /server irc.gamesurge.net, /join #sourcemod.dev in your IRC client.
  • If you wish to request to speak openly during the chat, PM “mark-” to get the floor.
  • During the Q&A session, we’ll take off moderation, put it back on to answer the first question given, then put take it off again. (Rinse, lather, repeat)
  • If it’s determined that you’re there to cause trouble you’ll be removed.
  • Bail’s Inbox, 2

    Wednesday, March 22nd, 2006

    Today’s selection:

    Feb 2005

    Hey! How to be coder? How know commands?
    How i learn coding? Can you help me?
    &lt;snip&gt;@hotmail.com

    How to be coder? Ingredients:

    • One (1) keyboard
    • The Internet
    • Lots and lots of {, }, (, and )
    • Place ingredients in a compiler. Let bake for 30 minutes. When done, link.

    Or, the command to be a coder is: amx_codeitup

    I hope this helps you learning to be a coder, as it’s usually easier to ask for personal tutoring than, say, reading the small book we wrote on scripting. Which, of course, is stickied.

    (Not surprisingly, this type of question is frequent.)

    Bail’s Inbox

    Sunday, March 19th, 2006

    Since SourceMod is on temporary hold and I’ve been using this area as a personal blog, I’m going to just continue with the flow of posting random things here. In that vein, here is a new feature, called “Bail’s Inbox”. I get a lot of crazy/nonsensical messages and people have found them funny. I’ll keep all senders anonymous. To start it off I’ll pick three from the AMX Mod X forums:

    Dec, 2005

    Subject: Help ME!
    Hey Help Me Installing Ur Deathmatch plugin in my server THX! :(

    For whatever reason, this user felt the need to capitalize every first letter, until their shift key strength ran out about half way through. This type of PM will never get a reply. The user doesn’t describe any sort of problem, and doesn’t bother to say at what point in the documentation (which they definitely didn’t read) they had problems. Hint to this person: Documentation isn’t there so people can coach you through it with one on one help sessions. It’s there for reading the instructions.

    Jan, 2006

    Hey, do you have an extra steam account with 1.6? My bro put a password onhis and its really annoying. Im so bored. I need to play CS 1.6 :,(

    A lot of people get the idea since I do a lot of coding and I have low SteamIDs, that for some reason I’m either connected to Valve or I have some stash of Valve accounts. I’m not, and I don’t. Furthermore, why would you message a random person and ask them to give you access to a paid game for free, under a violation of the TOS? Lastly, I’m pretty sure Steam accounts have a password by default — meaning this guy probably just got VAC banned.

    Jan, 2006

    hey dude !! nice work with the plugin :D but... i can't make it work...can you help me..?? please...can u write all the steps to instal the CSDM 2.00 ? i cant do it...please.... my msn is &lt;snip&gt;@msn.com 
    ty dude 4 the attention... have a nice day :D

    Another “I didn’t bother to read documentation, and I expect you to write it out for me again”. However what annoys me about this one in particular is the ever increasing trend to say “Here’s my MSN/AIM/etc”. I don’t use MSN/Yahoo/whatnot, don’t assume that I do, and don’t assume that if I did, I’d want to talk to you over it.

    Lastly, as a bonus:

    Feb, 2006

    U need beta tester? i can do it for you i got 2 server for spare even tho i dont like and do not support amxx but ill give it a try

    “I don’t like your product and I don’t support it, but can I beta test it?” Odd, for sure, but there’s no point in denying a potential customer. Unfortunately after a few weeks the user still hadn’t bothered to even open the reply PM so I removed it.

    More to come…

    Better Signature Scanning

    Monday, March 6th, 2006

    The other day, stat() was returning a filesize too large for an SO’s mapped memory on AMD64. I had to make a new function to calculate the size. This is what I came up with. Given “allocBase” is a pointer to the allocation base and “memSize” is the output, you can parse /proc/<pid>/maps/. Note that you could also use /proc/self/maps so the getpid() portion isn’t entirely necessary.

    	pid_t pid = getpid();
    	char file[255];
    	char buffer[2048];
    	snprintf(file, sizeof(file)-1, "/proc/%d/maps", pid);
    	FILE *fp = fopen(file, "rt");
    	if (!fp)
    		return false;
    	void *start=NULL;
    	void *end=NULL;
    	void *found=NULL;
    	while (!feof(fp))
    	{
    		fgets(buffer, sizeof(buffer)-1, fp);
    #if defined AMD64
    		sscanf(buffer, "%Lx-%Lx", &start, &end);
    #else
    		sscanf(buffer, "%lx-%lx", &start, &end);
    #endif
    
    		if (start == allocBase)
    		{
    			found = end;
    			break;
    		}
    	}
    	fclose(fp);
    
    	if (!found)
    		return false;
    
    	memSize = (unsigned long)end - (unsigned long)start;