Text Parsing in SourceMod – Out-performing Valve

For SourceMod, we decided to use Valve’s KeyValues file format for simplicity and familiarity. However, we didn’t use the provided KeyValues class. Why? To answer this question, let’s first take a look at the two major ways of writing text parsing API. There are “event based” parsers, and “tree based” parsers. Specifically, these are terms most commonly used with XML — but the KeyValues format has similar properties, such as nested sections, and so these parser patterns apply here.

A tree-based parser creates a big tree data structure out of the input text stream. The programmer then receives a pointer to the root node, and he can use this to recursively enumerate the tree (or just search for specific entries, depending on the contents). KeyValues is a tree based parser.

An event-based parser has no data structures. It reads the input character stream and fires user-provided callbacks for each major “event” encountered. It is then up to the programmer how to use these events. For example, events will be fired for:

  • Finding a new section starting or ending.
  • Find a new line.
  • Finding a key and value pair.

There are serious disadvantages to tree-based parsers. The immediate red flag you should see is that the tree structure must be allocated and built. If you don’t have a good allocator, you will wind up with a ton of small, unnecessary allocations for each node. On top of that, you waste memory. While both of these problems can be solved with a decent allocator, you also run into the problem of: what exactly does the programmer do with the tree structure? There are usually a few cases:

  1. You iterate over the tree and do a one-time operation/computation on all its contents. Then you delete the tree.
  2. You cache the tree and peek at its contents at a later time.
  3. You use the contents immediately by looking for a specific node, then delete the tree.
  4. You use the contents immediately by storing the data into a secondary structure, then delete the tree.

An event based parser is usually more ideal in all of these situations. Observe how for each method:

  1. All you’ve done is built a giant temporary structure that you discard. You could simply perform the operations as you receive the events, and remove the entire middle step of creating the structure. Iterating over the tree is fast, but creating the tree is unnecessary
  2. An event based parser isn’t always better here. However, if you are using a generic tree structure, peeking at it for a single entry may be very slow.
  3. This is even a bigger waste than #1. With an event based parser, you can stop parsing once you’ve hit the node you want. On top of that, you’ve removed the unnecessary tree structure AND eliminated a one-time recursive lookup in the tree.
  4. This is the worst of all. If you are going to use your own data structure, why bother building the tree in the first place? A better idea is to build the structure as you receive events.

So, why didn’t we use Valve’s KeyValues class? It’s a recursive tree parser. Worse yet, it’s generic. Everything and its mother in the SDK uses KeyValues, so it cannot possibly be optimized for each usage. A good example of this is in the Game Events system, which has name based lookup on event properties (a bad idea, as indexing them is not only more logical, but much, much faster).

At SourceMod, we wanted to show that not only were event based parsers better, but they allow you to quickly code alternative, faster data structures. At our arsenal, we had:

  • A highly optimized lookup class called a “double array trie.”
  • An optimized event-based parser for KeyValue-formatted files (source code provided below).
  • A fast lump allocator.

We took modevents.res from the cstrike folder of Counter-Strike: Source and wrote up a quick data structure using our pre-written, custom code library. Then we performed three benchmarks:

  1. Time to parse the file and load it into the respective data structure.
  2. Time to lookup two events in the structure – one in the middle, one at the beginning.
  3. Time to lookup a property in the event.

Our parser API: ITextParsers.h, CTextParsers.h, CTextParsers.cpp
Benchmark: benchmark.cpp

What happened?

[SOURCEMOD] Valve Read benchmark: 746622 clock ticks
[SOURCEMOD] SourceMod Read benchmark: 333864 clock ticks
[SOURCEMOD] Valve Lookup benchmark 1: 7740 clock ticks
[SOURCEMOD] SourceMod Lookup benchmark 1: 3339 clock ticks
[SOURCEMOD] Valve Lookup benchmark 2: 5391 clock ticks
[SOURCEMOD] SourceMod Lookup benchmark 2: 1827 clock ticks

In all three benchmarks, our code was more than twice as fast as Valve’s. Lesson: Always tune your data structures. One size does not fit all. Event-based parsers give you this full flexibility. As a bonus, you can implement a tree parser using an event parser (by building the tree during the events). Likewise, you can build an event parser using a tree parser (by firing an event for each iterated node), but that doesn’t make much sense.

Leave a Reply

You must be logged in to post a comment.