[tbnl-devel] UTF-8 problems -- patch

Ivan Shvedunov ivan4th at gmail.com
Thu Jul 28 18:03:56 UTC 2005


Hi.

On 7/28/05, Edi Weitz <edi at agharta.de> wrote:
> Hi!
> 
> On Thu, 28 Jul 2005 03:11:52 +0400, Ivan Shvedunov <ivan4th at gmail.com> wrote:
> 
> > Well, I've promised this patch somewhat earlier, but I didn't have
> > time to complete it...
> 
> Thanks for the patch.  See my comments below.

You're welcome :)

> > I've discovered several problems with TBNL's handling of
> > UTF-8. Namely, there was a problem with url-decode in util.lisp
> > which was turning UTF-8 urlencoded strings into something
> > incomprehensible,
> 
> Note that you're calling COERCE twice in your version of URL-DECODE.

Well, I hope that (coerce bytes '(vector (unsigned-byte 8))) in bytes-to-string
doesn't add much overhead when bytes are already '(vector
(unsigned-byte 8)), but it
allows one to pass just a vector of numbers there without making Lisp complain
about it.

> 
> > and also there was problem with Content-Length in modlisp.lisp which
> > was causing UTF-8 content to be truncated.
> >
> > The attached patch works only with SBCL. I mean that it shouldn't
> > break other Lisps, but proper unicode hanling is implemented only
> > for SBCL. I've tried to make it work with Allegro demo/LispWorks
> > Personal Edition, but with no luck. Well, concerning Allegro, the
> > problem here is that sockets that are used to talk to mod_lisp are
> > set to latin-1 encoding for some reason, most likely KMRCL needs to
> > be fixed a bit, again, unfortunatelly I just have no time to
> > complete this. As of LispWorks, I just don't know how to turn a
> > string into series of octets and vice versa using current encoding -
> > i.e. I didn't find something like Allegro/SBCL
> > octets-to-string/string-to-octets there.
> 
> The file test/test.lisp demonstrates the usage of
> 
>   external-format:encode-lisp-string
> 
> for LispWorks.  See also
> 
>   <http://thread.gmane.org/gmane.lisp.lispworks.general/3481>

Thanks for pointer, I'll look at it.

> 
> > Concerning implementation - I've introduced :tbnl-unicode feature
> > that is set for supported Unicode-aware Lisps in specials.lisp (I'm
> > setting it for Allegro and SBCL, thogh it doesn't help Allegro
> > much).
> 
> My main concern is that at the moment the external format is kind of
> hard-coded into TBNL (or relying on some global setting), so if for
> example you use UTF-8 you can't serve binary content like JPGs
> anymore.  Wouldn't it be better if content were always sent as a
> sequence of octets?  (That would also solve the AllegroCL problem you
> mention above.)

I think this will be DEFINITELY better. I just haven't studied TBNL
sources enough
and don't know whether this will require a lot of changes. Well, it's possible
to make simple versions of bytes-to-string and string-to-bytes funcs
for non-Unicode lisps
(utilizing char-code/code-char) and then convert the code to binary output mode.

> > Also I've added supporting funcs, bytes-to-string and
> > string-to-bytes (defined only when #+tbnl-unicode) that do the dirty
> > job of string conversion.
> 
> I'd prefer if they were called "bytes" and not "octets" because a byte
> doesn't necessarily have 8 bits.

? They _are_ called "bytes"...

> They should also be exported from the TBNL package, shouldn't they?

Yes, I think they can be useful.

I'll try to build a more elaborate patch, but probably this will happen
no earlier than next week.

Ivan.



More information about the Tbnl-devel mailing list