[tbnl-devel] UTF-8 problems -- patch

Edi Weitz edi at agharta.de
Thu Jul 28 17:37:20 UTC 2005


Hi!

On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov <ivan4th at gmail.com> wrote:

> Well, I've promised this patch somewhat earlier, but I didn't have
> time to complete it...

Thanks for the patch.  See my comments below.

> I've discovered several problems with TBNL's handling of
> UTF-8. Namely, there was a problem with url-decode in util.lisp
> which was turning UTF-8 urlencoded strings into something
> incomprehensible,

Note that you're calling COERCE twice in your version of URL-DECODE.

> and also there was problem with Content-Length in modlisp.lisp which
> was causing UTF-8 content to be truncated.
>
> The attached patch works only with SBCL. I mean that it shouldn't
> break other Lisps, but proper unicode hanling is implemented only
> for SBCL. I've tried to make it work with Allegro demo/LispWorks
> Personal Edition, but with no luck. Well, concerning Allegro, the
> problem here is that sockets that are used to talk to mod_lisp are
> set to latin-1 encoding for some reason, most likely KMRCL needs to
> be fixed a bit, again, unfortunatelly I just have no time to
> complete this. As of LispWorks, I just don't know how to turn a
> string into series of octets and vice versa using current encoding -
> i.e. I didn't find something like Allegro/SBCL
> octets-to-string/string-to-octets there.

The file test/test.lisp demonstrates the usage of

  external-format:encode-lisp-string

for LispWorks.  See also

  <http://thread.gmane.org/gmane.lisp.lispworks.general/3481>

> Concerning implementation - I've introduced :tbnl-unicode feature
> that is set for supported Unicode-aware Lisps in specials.lisp (I'm
> setting it for Allegro and SBCL, thogh it doesn't help Allegro
> much).

My main concern is that at the moment the external format is kind of
hard-coded into TBNL (or relying on some global setting), so if for
example you use UTF-8 you can't serve binary content like JPGs
anymore.  Wouldn't it be better if content were always sent as a
sequence of octets?  (That would also solve the AllegroCL problem you
mention above.)

> Also I've added supporting funcs, bytes-to-string and
> string-to-bytes (defined only when #+tbnl-unicode) that do the dirty
> job of string conversion.

I'd prefer if they were called "bytes" and not "octets" because a byte
doesn't necessarily have 8 bits.  They should also be exported from
the TBNL package, shouldn't they?

Thanks,
Edi.



More information about the Tbnl-devel mailing list