Probing Encarta 2007

published: Thu, 17-Aug-2006   |   updated: Fri, 18-Aug-2006

Last time I was visiting Microsoft, the organizers for the event I was attending had provided a trip to the company store, together with a voucher allowing us to spend up to $120 on software and hardware. Before you start thinking that's way too little to buy anything of import, realize that the Microsoft Company Store has deep discounts. Amongst other things, I picked up a copy of Encarta Premium 2007 for $20. I like Encarta (I'd bought my previous version of it at the company store too), but this time, it came with a little frisson. If you like algorithms (duh, that's why you're here) and you program in Visual Studio 2005 and .NET 2.0 and want a little fun, read on.

The side of the box gives us a sly hint: Requires Microsoft .NET Framework 2.0. Yep, at least part of Encarta 2007 is managed code, even though the main application is not.

So I took a gander at what was installed. Something caught my eye immediately: a DLL called MICROSOFT.ENCARTA.BTREE.DLL. Gosh, oo-er, way cool, a B-tree implementation. And then I just had to fire up Reflector to see how the B-tree was implemented. The assembly wasn't obfuscated, so I could see the types and disassemble the methods.

And then I just had to fire up VS2005 to write a program that uses it. Here's what I found.

First, I did a little investigation into what I perceived as the main type BTreeFile. Obviously a file that holds a B-tree (never let it be said that I don't catch on quickly).

BTreeFile file = new BTreeFile();
file.Create("test.btf", new BTreeKey_UInt32(), 100);
file.Close();

The call to Create() took a little surfing with Reflector to get right. The first parameter is merely the file name. The second parameter is an instance of a descendant of BTreeKey. It's called keyFactory and isn't an actual key per se. It's used internally for calling the virtual BTreeKey.CreateNew() instance method to generate new instances of the same type without having to go through reflection gyrations to do so. The third parameter is the maximum number of keys per B-tree node.

Running this application created a 103-byte file on disk, containing just the header block for the B-tree.

00000000  28 43 6F 70  79 72 69 67  68 74 20 28  63 29 20 4D  (Copyright (c) M
00000010  69 63 72 6F  73 6F 66 74  20 43 6F 72  70 6F 72 61  icrosoft Corpora
00000020  74 69 6F 6E  20 32 30 30  34 01 05 00  00 00 00 00  tion 2004·······
00000030  01 00 00 00  00 00 00 00  00 64 00 27  4D 69 63 72  ·········d·'Micr
00000040  6F 73 6F 66  74 2E 45 6E  63 61 72 74  61 2E 42 54  osoft.Encarta.BT
00000050  72 65 65 2E  42 54 72 65  65 4B 65 79  5F 55 49 6E  ree.BTreeKey_UIn
00000060  74 33 32 00  00 00 00                               t32····

Nothing too interesting, and if I really wanted to I could trace through the code to see what it all means (well, apart from the copyright statement and the type name for the key factory, which are fairly obvious). I imagine the 64 hex in there is the number of keys per node.

Let's see if we can add a key. At first I tried to write this:

BTreeFile file = new BTreeFile();
file.Create("test.btf", new BTreeKey_UInt32(), 100);
try {
  BTreeKey key = new BTreeKey_UInt32();
  file.InsertKey(key);
}
finally {
  file.Close();
}

But it didn't make any sense. How do I set the value of the key? The constructor to BTreeKey_UInt32 has no parameters. Typing key. and looking at Intellisense didn't reveal anything in particular. There is an OriginalKey() method but it returns an object, it doesn't set it. Back to Reflector.

It turns out that casting to the ancestor type was the wrong thing to do. The actual type has a method called SetKey():

BTreeFile file = new BTreeFile();
file.Create("test.btf", new BTreeKey_UInt32(), 100);
try {
  BTreeKey_UInt32 key = new BTreeKey_UInt32();
  key.SetKey(12345);
  file.InsertKey(key);
}
finally {
  file.Close();
}

This didn't work, at least in the sense I was thinking of. The file size was still 103 bytes and looked exactly the same as before. Back to Reflector. Ah ha! Close() closes the file, sure, but there's a cache involved. We need to save our changes first.

BTreeFile file = new BTreeFile();
file.Create("test.btf", new BTreeKey_UInt32(), 100);
try {
  BTreeKey key = new BTreeKey_UInt32();
  key.SetKey(12345);
  file.InsertKey(key);
  file.Save();
}
finally {
  file.Close();
}

There are two overloads of Save(): one taking no parameters and the other taking a modify bool parameter. I'm not sure yet what the modify parameter does (it involves the cache in some way but I'll have to do more investigation with Reflector), but the parameterless Save() calls the other using the value false.

This still leaves a problem though. A B-tree that stores only keys is not that interesting. There must be a value associated with the key; the record, if you will. More tracing with Reflector reveals that there is a _fileOffset field in the ancestor BTreeKey class, but nothing seemed to be setting it. And then it hit me: the member was public: you could just set the field to whatever you wanted.

I'm sorry but at first glance this seems really daft: there's a method called SetKey(), which merely sets the _key member of the object, but there's no method for the file offset. Party on, it seems.

So, now I can see how this BTreeFile class is supposed to be used. We have to assume an external file containing the data. This file's "records" are accessed by a (full 32-bit) file offset and I imagine that each "record" has a length which is encoded at that offset with the data following, or that the file merely contains fixed-length records. When we want to index those records, we get their keys (in my code we would have 32-bit unsigned integers as a key) and the file offset to the data, and then add the key/offset data to the B-tree.

BTreeFile file = new BTreeFile();
file.Create("test.btf", new BTreeKey_UInt32(), 100);
try {
  BTreeKey_UInt32 key = new BTreeKey_UInt32();
  key.SetKey(12345);
  key._fileOffset = 32768;
  file.InsertKey(key);
  file.Save();
}
finally {
  file.Close();
}

So in my code above: I'm associating key 12345 with an offset into some other file of 32768.

Obviously, creating a B-tree file every time is silly; we have to be able to open an existing one. And once it's open we have to be able to find the key we'd already inserted.

file = new BTreeFile();
file.Open("test.btf", new PriorityCache(0x4e2000));
try {
  BTreeKey_UInt32 searchKey = new BTreeKey_UInt32();
  searchKey.SetKey(12345);
  BTreeKey_UInt32 foundKey = file.Find(searchKey) as BTreeKey_UInt32;
  if (foundKey._fileOffset != 32768) 
    throw new Exception("file offset in found key is wrong");
}
finally {
	file.Close();
}

The weird thing here is the magic number passed as a parameter to the PriorityCache constructor. It's called maxSize, and the value I use is the value used by BTreeFile.Create(). No, I don't know why this particular number (it's 5120000 in decimal, so 10000 times 512?).

Couple of observations. First is that this assembly is non-redistributable. It's part and parcel of Encarta 2007, so to play around with it, you've got to buy the product. And that in turn indicates all this post can be: just playing around with a niftily named DLL. Second is that, although the assembly is .NET 2.0, there is nary a glimpse of a generic type or method. I kept wanting to write BTreeKey<int> or something.

Still it's interesting to see inside a Microsoft .NET 2.0 assembly for a large widely-distributed product. I'll have some more comments about it in another post, but for now, I must call it a day.