Probing Encarta 2007
published: Thu, 17-Aug-2006 | updated: Fri, 18-Aug-2006
Last time I was visiting Microsoft, the organizers for the event I was attending had provided a trip to the company store, together with a voucher allowing us to spend up to $120 on software and hardware. Before you start thinking that's way too little to buy anything of import, realize that the Microsoft Company Store has deep discounts. Amongst other things, I picked up a copy of Encarta Premium 2007 for $20. I like Encarta (I'd bought my previous version of it at the company store too), but this time, it came with a little frisson. If you like algorithms (duh, that's why you're here) and you program in Visual Studio 2005 and .NET 2.0 and want a little fun, read on.
The side of the box gives us a sly hint: Requires Microsoft .NET Framework 2.0. Yep, at least part of Encarta 2007 is managed code, even though the main application is not.
So I took a gander at what was installed. Something caught my eye immediately: a DLL called MICROSOFT.ENCARTA.BTREE.DLL. Gosh, oo-er, way cool, a B-tree implementation. And then I just had to fire up Reflector to see how the B-tree was implemented. The assembly wasn't obfuscated, so I could see the types and disassemble the methods.
And then I just had to fire up VS2005 to write a program that uses it. Here's what I found.
First, I did a little investigation into what I perceived as the main
type BTreeFile
. Obviously a file that holds a B-tree
(never let it be said that I don't catch on quickly).
BTreeFile file = new BTreeFile(); file.Create("test.btf", new BTreeKey_UInt32(), 100); file.Close();
The call to Create()
took a little surfing with Reflector
to get right. The first parameter is merely the file name. The second
parameter is an instance of a descendant of BTreeKey
.
It's called keyFactory
and isn't an actual key per se.
It's used internally for calling the virtual
BTreeKey.CreateNew()
instance method to generate new
instances of the same type without having to go through reflection
gyrations to do so. The third parameter is the maximum number of keys
per B-tree node.
Running this application created a 103-byte file on disk, containing just the header block for the B-tree.
00000000 28 43 6F 70 79 72 69 67 68 74 20 28 63 29 20 4D (Copyright (c) M 00000010 69 63 72 6F 73 6F 66 74 20 43 6F 72 70 6F 72 61 icrosoft Corpora 00000020 74 69 6F 6E 20 32 30 30 34 01 05 00 00 00 00 00 tion 2004······· 00000030 01 00 00 00 00 00 00 00 00 64 00 27 4D 69 63 72 ·········d·'Micr 00000040 6F 73 6F 66 74 2E 45 6E 63 61 72 74 61 2E 42 54 osoft.Encarta.BT 00000050 72 65 65 2E 42 54 72 65 65 4B 65 79 5F 55 49 6E ree.BTreeKey_UIn 00000060 74 33 32 00 00 00 00 t32····
Nothing too interesting, and if I really wanted to I could trace through the code to see what it all means (well, apart from the copyright statement and the type name for the key factory, which are fairly obvious). I imagine the 64 hex in there is the number of keys per node.
Let's see if we can add a key. At first I tried to write this:
BTreeFile file = new BTreeFile(); file.Create("test.btf", new BTreeKey_UInt32(), 100); try { BTreeKey key = new BTreeKey_UInt32(); file.InsertKey(key); } finally { file.Close(); }
But it didn't make any sense. How do I set the value of the key? The
constructor to BTreeKey_UInt32
has no parameters. Typing
key.
and looking at Intellisense didn't reveal anything
in particular. There is an OriginalKey()
method but it
returns an object, it doesn't set it. Back to Reflector.
It turns out that casting to the ancestor type was the wrong thing to
do. The actual type has a method called SetKey()
:
BTreeFile file = new BTreeFile(); file.Create("test.btf", new BTreeKey_UInt32(), 100); try { BTreeKey_UInt32 key = new BTreeKey_UInt32(); key.SetKey(12345); file.InsertKey(key); } finally { file.Close(); }
This didn't work, at least in the sense I was thinking of. The file
size was still 103 bytes and looked exactly the same as before. Back
to Reflector. Ah ha! Close()
closes the file, sure, but
there's a cache involved. We need to save our changes first.
BTreeFile file = new BTreeFile(); file.Create("test.btf", new BTreeKey_UInt32(), 100); try { BTreeKey key = new BTreeKey_UInt32(); key.SetKey(12345); file.InsertKey(key); file.Save(); } finally { file.Close(); }
There are two overloads of Save()
: one taking no
parameters and the other taking a modify
bool
parameter. I'm not sure yet what the
modify
parameter does (it involves the cache in some way
but I'll have to do more investigation with Reflector), but the
parameterless Save()
calls the other using the value
false
.
This still leaves a problem though. A B-tree that stores only keys is
not that interesting. There must be a value associated with the key;
the record, if you will. More tracing with Reflector reveals that
there is a _fileOffset
field in the ancestor
BTreeKey
class, but nothing seemed to be setting it. And
then it hit me: the member was public: you could just set the field to
whatever you wanted.
I'm sorry but at first glance this seems really daft: there's a method
called SetKey()
, which merely sets the _key member of the
object, but there's no method for the file offset. Party on, it seems.
So, now I can see how this BTreeFile
class is supposed to
be used. We have to assume an external file containing the data. This
file's "records" are accessed by a (full 32-bit) file offset and I
imagine that each "record" has a length which is encoded at that
offset with the data following, or that the file merely contains
fixed-length records. When we want to index those records, we get
their keys (in my code we would have 32-bit unsigned integers as a
key) and the file offset to the data, and then add the key/offset data
to the B-tree.
BTreeFile file = new BTreeFile(); file.Create("test.btf", new BTreeKey_UInt32(), 100); try { BTreeKey_UInt32 key = new BTreeKey_UInt32(); key.SetKey(12345); key._fileOffset = 32768; file.InsertKey(key); file.Save(); } finally { file.Close(); }
So in my code above: I'm associating key 12345 with an offset into some other file of 32768.
Obviously, creating a B-tree file every time is silly; we have to be able to open an existing one. And once it's open we have to be able to find the key we'd already inserted.
file = new BTreeFile(); file.Open("test.btf", new PriorityCache(0x4e2000)); try { BTreeKey_UInt32 searchKey = new BTreeKey_UInt32(); searchKey.SetKey(12345); BTreeKey_UInt32 foundKey = file.Find(searchKey) as BTreeKey_UInt32; if (foundKey._fileOffset != 32768) throw new Exception("file offset in found key is wrong"); } finally { file.Close(); }
The weird thing here is the magic number passed as a parameter to the
PriorityCache
constructor. It's called
maxSize
, and the value I use is the value used by
BTreeFile.Create()
. No, I don't know why this particular
number (it's 5120000 in decimal, so 10000 times 512?).
Couple of observations. First is that this assembly is
non-redistributable. It's part and parcel of Encarta 2007, so to play
around with it, you've got to buy the product. And that in turn
indicates all this post can be: just playing around with a niftily
named DLL. Second is that, although the assembly is .NET 2.0, there is
nary a glimpse of a generic type or method. I kept wanting to write
BTreeKey<int>
or something.
Still it's interesting to see inside a Microsoft .NET 2.0 assembly for a large widely-distributed product. I'll have some more comments about it in another post, but for now, I must call it a day.