Fixing EZDSL for Unicode
published: Wed, 26-Dec-2007 | updated: Mon, 31-Dec-2007
And so I got bored during the down times between two performances of A Christmas Carol and decided to take a look at EZDSL with regard to the promised Unicode support that should appear in the next version of Delphi (codenamed Tiburón).
Of course, at present there is no official word of what is going to be made available but there have been enough hints from various CodeGear people that I can make some educated assumptions about what will change.
- The char type will change from
AnsiCharto a Unicode character type likeWideChar. In particular,sizeof(char)will no longer equal 1, but will almost certainly become 2, just like in .NET. - The
stringtype will change from the currentAnsiStringto a Unicode string type. I think I've seen this latter type calledUnicodeString, but it may just be a different version ofWideString. In particular, again, the elements of the string will be Unicode characters. I've no idea what the internal representation of the new string type will be, but I'm pretty sure I have no code that makes any assumptions about it. PCharwill still be a pointer to achar, but in particular the size of the underlying type will become 2. This is an almost certain gotcha, as I can well imagine I've usedPChars in places and assumed that the size of the underlying type is 1.
In essence the char type will become two bytes in size
and the rest pretty much flows from that.
As I had some time, I thought I'd have a look at the EZDSL code and see what I could find given these assumptions. This may help you in looking at your own legacy code and seeing what needs to be done.
First thing: search for "string" and see what comes up. EZDSLBSE.PAS had this:
EZVersionNumber : string[4] = '3.03';
Er, well, dunno what this is going to do. A string declared like this is a short string, and I'm fairly positive they'll still exist, otherwise lots of code will get broken. I presume though that short strings will remain non-Unicode. Maybe it's time just to declare this identifier like this instead (and I'm fairly sure that the original code was a space-optimization hangover from Delphi 1):
EZVersionNumber = '3.03';
Next up that was really interesting was this case block in the hash
table code in EZDSLHSH.PAS (I've tidied it up by removing the
$IFDEF Windows bits):
hesInUse : begin
{the state is 'in use', we check to see if it's
our string, if it is, exit returning true and
the index}
if IgnoreCase then begin
if (AnsiCompareText(heString, aKey) = 0) then begin
aIndex := KeyHash;
Result := true;
Exit;
end;
end
else begin
if (heString = aKey) then begin
aIndex := KeyHash;
Result := true;
Exit;
end;
end;
end;
Here we're comparing a key being passed in (the hash table in EZDSL
only uses strings as keys) with a key inside the hash table. There are
two cases since the hash table can either use its keys in a
case-insensitive manner or a case-sensitive one. The issue for me here
(apart from the fact I wouldn't write code like this these days), is
this: what is going to happen to AnsiCompareText? Its
name seems to imply ANSI characters, or single-byte characters. Will
this be modified to work with Unicode strings, but still keep the same
name? I don't know.
Now, since this is the only place in EZDSL that I used the
AnsiCompareXxx functions, my best bet is to isolate this
code in a separate method for now, and deal with the reality of what
happens in Tiburón when I get it.
Moving on I came to EZDSLSUP.PAS, a set of supplementary routines for EZDSL users. In this unit I have a set of simple routines for short strings (allocating them on the heap and so on). Since I'm assuming that short strings will continue to exist, I think I can just assume that these routines will continue to work. Which is a good decision to make since some of them are written in BASM, and I don't want to have to fiddle around with that.
Next up in my search is EZLNGSTK.PAS, an example stack that stores strings (or as I call them here, long strings). Unfortunately here I'm deliberately assuming how a long string works (that is, it is reference counted and casting to a pointer fools the reference counting) and I'm manipulating the record count through dodgy code.
procedure TStringStack.Push(const S : string);
var
NewS : string;
begin
{increment the reference count}
NewS := S;
{push the string as pointer}
Stack.Push(pointer(NewS));
{the compiler will set NewS to '' for us at the end statement and
decrement the ref count, but we don't want it to: the stack now
has one reference to the string; so fool the compiler}
pointer(NewS) := nil;
end;
Yuk. This unit will just have to be changed to be more explicit about what I'm doing (that is, I'll have to allocate a node on the heap that comprises a single string field). Either that or I will wait until I can see how the new string type is implemented. Thinking about it, plan A is best: I no longer get a vicarious thrill from writing code that relies on internal structures and behavior.
Next up is EZSTRSTK.PAS, which just needs rewriting (with Delphi32, I pass long strings in and I copy short strings onto the stack; just nasty and I deserve a quick slap for it).
That's it for strings. Mostly the code should compile and work just fine with Unicode strings, but as indicated I do have some unanswerable questions at the moment.
Let's see how I do with the char type. In searching for
'char' I came across many uses of FillChar and I'm
assuming that this won't change since all it requires is a pointer, a
size in bytes, and a byte value to use. Nothing to do with
chars per se. Apart from that all I hit were instances of
PChar, so let's look at those.
First thing I got was a classic "assume a character is one byte" issue in EZDSLBSE.PAS:
procedure TNodeStore.nsGrowSpareNodeStack;
var
i : integer;
Temp : PNode;
Node : PNode;
WalkerNode : PChar absolute Node; {for pointer arithmetic}
begin
SafeGetMem(Temp, nsNodeSize * cNumNodes);
Temp^.Link := nsBlock;
nsBlock := Temp;
Node := nsBlock;
WalkerNode := WalkerNode + nsNodeSize; {alters Node}
for i := 1 to pred(cNumNodes) do begin
Node^.Link := nsNodeStack;
nsNodeStack := Node;
WalkerNode := WalkerNode + nsNodeSize; {alters Node}
end;
inc(nsSpareNodeCount, pred(cNumNodes));
end;
This is a typical example of too-clever-by-half code, but there's
nothing for it because of the way I've defined TNode.
TNode is a record that has different sizes depending on
how it is used, so it doesn't have a set size and
sizeof() will typically produce the wrong answer. In this
particular method, I'm allocating an array of TNodes (but
I can't declare it as an array) and then walking through the "logical"
nodes, pushing them on the free list. To do the walking I'm using a
PChar typecast and adding the size of the node to this
every cycle through the loop. And, implicitly it assumes that the size
of a char is one byte. The fix is simple as it happens: declare a
pointer-to-a-byte type (I really can't remember when
PByte was first declared in the VCL), and then use that
instead of PChar. The alternative is to divide the size
of the node by 2 (the size of a Tiburón char) in the
code that increments the pointer, but this is getting silly: the code
has nothing to do with characters, so let's not pretend that it does.
The next use of PChar came up in EZDSLTHD.PAS, where I
use it to "typecast" a string to a PChar. All this is
doing in effect is to get a pointer to the first character in the
string so that it can be passed to an external Windows routine (in
this case, CreateMutex et al). Since this type of code occurs
throughout the VCL, I'm going to assume that the compiler/VCL in
Tiburón will use the "W" versions of the Windows routines
rather than the "A" versions, as they did in the past. In other words:
I shall assume that this code will just work without change.
Update: 31-Dec-2007
Hallvard Vassbotn was kind
enough to drop me a line to say that the proposed solution to my
questionable use of PChar wouldn't work. I'd forgotten that
PChar is treated specially by the compiler for pointer
arithmetic. Sigh. So the solution is to use PAnsiChar instead
or to divide the increment by sizeof(char).
...Which brings up another point. I had a look at the code from
my book a couple of days later. In it, I'd
been very clever: I'd used AnsiChar and PAnsiChar
throughout. Unfortunately, I'd also used string throughout too.
Double sigh.
