[elephant-devel] new elephant and unicode troubles

Ties Stuij cjstuij at gmail.com
Sun Feb 25 12:58:38 UTC 2007


> Hi Ties! Nothing beats a sunday morning bughunt!
Aint that the truth!

to clarify:
as i understand it, the code just wants to circumvent the whole
implementation dependent unicode troubles, by relying on char codes.
This is fine, but this would mean that elephant db-ses encoded by one
implementation in one character set can not be properly decoded by
other implementations or the same implementation which are set to
other encoding types. Also this forces a variable width format to
fixed byte length, doubling strings in length if just one char is
above the mean so to speak.

I for one would rather reintroduce the unicode troubles by dispatching
on string-type and then using something similar to, if not exactly the
same as arnesi's string-to-octets  and octets-to-string to encode and
decode the string to bytes in stead of using char-code and code-char.
This shouldn't be to hard to implement and would have the added bonus
of making the code a lot cleaner.

But then of course i just dont have the time atm to implement it, and
it doesn't have so much priority for me, but if you wait half a year
i'll send a patch.

In the meantime i would in addition to henriks suggestion suggest to
heighten the upper-limit for char-codes in serialize-to-utf8 from x7f
to xff. E.g. this part:

           (etypecase string
              (simple-string
               (loop for i fixnum from 0 below characters do
                    (let ((code (char-code (schar string i))))
                      (declare (type fixnum code))
                      (when (> code #xff) (fail))
                      (setf (uffi:deref-array buffer
'array-or-pointer-char (+ i size)) code))))
              (string
               (loop for i fixnum from 0 below characters do
                    (let ((code (char-code (char string i))))
                      (declare (type fixnum code))
                      (when (> code #xff) (fail))
                      (setf (uffi:deref-array buffer
'array-or-pointer-char (+ i size)) code)))))

this will reduce the times you would need the two-byte encoding to
almost never from where i'm sitting, and i don't see the point in
having an upper limit of x7f. If your lisp doesn't support char-codes
higher than x7f it won't try to encode them, and why not use all the
room the byte has to offer. Or am i missing something? That's usually
the case.

greets,
Ties



More information about the elephant-devel mailing list