Fixing EZDSL for Unicode

published: Wed, 26-Dec-2007   |   updated: Mon, 31-Dec-2007
Close-up of old typewriter

And so I got bored during the down times between two performances of A Christmas Carol and decided to take a look at EZDSL with regard to the promised Unicode support that should appear in the next version of Delphi (codenamed Tiburón).

Of course, at present there is no official word of what is going to be made available but there have been enough hints from various CodeGear people that I can make some educated assumptions about what will change.

  1. The char type will change from AnsiChar to a Unicode character type like WideChar. In particular, sizeof(char) will no longer equal 1, but will almost certainly become 2, just like in .NET.
  2. The string type will change from the current AnsiString to a Unicode string type. I think I've seen this latter type called UnicodeString, but it may just be a different version of WideString. In particular, again, the elements of the string will be Unicode characters. I've no idea what the internal representation of the new string type will be, but I'm pretty sure I have no code that makes any assumptions about it.
  3. PChar will still be a pointer to a char, but in particular the size of the underlying type will become 2. This is an almost certain gotcha, as I can well imagine I've used PChars in places and assumed that the size of the underlying type is 1.

In essence the char type will become two bytes in size and the rest pretty much flows from that.

As I had some time, I thought I'd have a look at the EZDSL code and see what I could find given these assumptions. This may help you in looking at your own legacy code and seeing what needs to be done.

First thing: search for "string" and see what comes up. EZDSLBSE.PAS had this:

 EZVersionNumber : string[4] = '3.03';

Er, well, dunno what this is going to do. A string declared like this is a short string, and I'm fairly positive they'll still exist, otherwise lots of code will get broken. I presume though that short strings will remain non-Unicode. Maybe it's time just to declare this identifier like this instead (and I'm fairly sure that the original code was a space-optimization hangover from Delphi 1):

 EZVersionNumber = '3.03';

Next up that was really interesting was this case block in the hash table code in EZDSLHSH.PAS (I've tidied it up by removing the $IFDEF Windows bits):

 hesInUse   : begin
                {the state is 'in use', we check to see if it's
                 our string, if it is, exit returning true and
                 the index}
                if IgnoreCase then begin
                  if (AnsiCompareText(heString, aKey) = 0) then begin
                    aIndex := KeyHash;
                    Result := true;
                else begin
                  if (heString = aKey) then begin
                    aIndex := KeyHash;
                    Result := true;

Here we're comparing a key being passed in (the hash table in EZDSL only uses strings as keys) with a key inside the hash table. There are two cases since the hash table can either use its keys in a case-insensitive manner or a case-sensitive one. The issue for me here (apart from the fact I wouldn't write code like this these days), is this: what is going to happen to AnsiCompareText? Its name seems to imply ANSI characters, or single-byte characters. Will this be modified to work with Unicode strings, but still keep the same name? I don't know.

Now, since this is the only place in EZDSL that I used the AnsiCompareXxx functions, my best bet is to isolate this code in a separate method for now, and deal with the reality of what happens in Tiburón when I get it.

Moving on I came to EZDSLSUP.PAS, a set of supplementary routines for EZDSL users. In this unit I have a set of simple routines for short strings (allocating them on the heap and so on). Since I'm assuming that short strings will continue to exist, I think I can just assume that these routines will continue to work. Which is a good decision to make since some of them are written in BASM, and I don't want to have to fiddle around with that.

Next up in my search is EZLNGSTK.PAS, an example stack that stores strings (or as I call them here, long strings). Unfortunately here I'm deliberately assuming how a long string works (that is, it is reference counted and casting to a pointer fools the reference counting) and I'm manipulating the record count through dodgy code.

procedure TStringStack.Push(const S : string);
  NewS : string;
  {increment the reference count}
  NewS := S;
  {push the string as pointer}
  {the compiler will set NewS to '' for us at the end statement and
   decrement the ref count, but we don't want it to: the stack now
   has one reference to the string; so fool the compiler}
  pointer(NewS) := nil;

Yuk. This unit will just have to be changed to be more explicit about what I'm doing (that is, I'll have to allocate a node on the heap that comprises a single string field). Either that or I will wait until I can see how the new string type is implemented. Thinking about it, plan A is best: I no longer get a vicarious thrill from writing code that relies on internal structures and behavior.

Next up is EZSTRSTK.PAS, which just needs rewriting (with Delphi32, I pass long strings in and I copy short strings onto the stack; just nasty and I deserve a quick slap for it).

That's it for strings. Mostly the code should compile and work just fine with Unicode strings, but as indicated I do have some unanswerable questions at the moment.

Let's see how I do with the char type. In searching for 'char' I came across many uses of FillChar and I'm assuming that this won't change since all it requires is a pointer, a size in bytes, and a byte value to use. Nothing to do with chars per se. Apart from that all I hit were instances of PChar, so let's look at those.

First thing I got was a classic "assume a character is one byte" issue in EZDSLBSE.PAS:

procedure TNodeStore.nsGrowSpareNodeStack;
  i : integer;
  Temp : PNode;
  Node : PNode;
  WalkerNode : PChar absolute Node; {for pointer arithmetic}
  SafeGetMem(Temp, nsNodeSize * cNumNodes);
  Temp^.Link := nsBlock;
  nsBlock := Temp;
  Node := nsBlock;
  WalkerNode := WalkerNode + nsNodeSize; {alters Node}
  for i := 1 to pred(cNumNodes) do begin
    Node^.Link := nsNodeStack;
    nsNodeStack := Node;
    WalkerNode := WalkerNode + nsNodeSize; {alters Node}
  inc(nsSpareNodeCount, pred(cNumNodes));

This is a typical example of too-clever-by-half code, but there's nothing for it because of the way I've defined TNode. TNode is a record that has different sizes depending on how it is used, so it doesn't have a set size and sizeof() will typically produce the wrong answer. In this particular method, I'm allocating an array of TNodes (but I can't declare it as an array) and then walking through the "logical" nodes, pushing them on the free list. To do the walking I'm using a PChar typecast and adding the size of the node to this every cycle through the loop. And, implicitly it assumes that the size of a char is one byte. The fix is simple as it happens: declare a pointer-to-a-byte type (I really can't remember when PByte was first declared in the VCL), and then use that instead of PChar. The alternative is to divide the size of the node by 2 (the size of a Tiburón char) in the code that increments the pointer, but this is getting silly: the code has nothing to do with characters, so let's not pretend that it does.

The next use of PChar came up in EZDSLTHD.PAS, where I use it to "typecast" a string to a PChar. All this is doing in effect is to get a pointer to the first character in the string so that it can be passed to an external Windows routine (in this case, CreateMutex et al). Since this type of code occurs throughout the VCL, I'm going to assume that the compiler/VCL in Tiburón will use the "W" versions of the Windows routines rather than the "A" versions, as they did in the past. In other words: I shall assume that this code will just work without change.

Update: 31-Dec-2007

Hallvard Vassbotn was kind enough to drop me a line to say that the proposed solution to my questionable use of PChar wouldn't work. I'd forgotten that PChar is treated specially by the compiler for pointer arithmetic. Sigh. So the solution is to use PAnsiChar instead or to divide the increment by sizeof(char).

...Which brings up another point. I had a look at the code from my book a couple of days later. In it, I'd been very clever: I'd used AnsiChar and PAnsiChar throughout. Unfortunately, I'd also used string throughout too. Double sigh.