[tbnl-devel] Re: TBNL: URL-DECODE and UTF-8 parameters

Edi Weitz edi at agharta.de
Thu Sep 22 12:30:55 UTC 2005


[Please use the mailing list - see Cc (and register first).]

Hi Will!

On Wed, 21 Sep 2005 23:52:41 -0700, Will <will at glozer.net> wrote:

> CafeSpot came a problem in the URL-DECODE function of TBNL, it
> doesn't decode UTF-8 encoded URLs correctly.  I see there was a
> thread on this in July,
> http://common-lisp.net/pipermail/tbnl-devel/2005-July/000358.html,
> but apparently no resolution.  Enclosed is a new version of the
> function, I'm a lisp newbie so it may not be ideal =)
>
> This particular function only works in Allegro, but it would work in
> any lisp that has a function to convert a UTF-8 encoded octet array
> to a string.  I belive SBCL has a similar OCTETS-TO-STRING function,
> I didn't see anything really obvious for LispWorks though.  At the
> moment I only have ACL.
>
> ----
>
> (defun url-decode (string)
>   (let ((string-length (length string)))
>     (flet ((parse-hex-escape (start)
>              (if (<= (+ start 3) string-length)
>                  (parse-integer string
>                                 :start (+ start 1)
>                                 :end (+ start 3)
>                                 :radix 16)
>                  (error "invalid hex encoding in string '~A'" string))))
>       (let ((vector (make-array string-length
>                                 :adjustable t
>                                 :element-type '(unsigned-byte 8)
>                                 :fill-pointer 0)))
>         (loop
>            for i below string-length
>            for char = (aref string i)
>            do (vector-push-extend
>                (case char
>                  ((#\+) (char-code #\Space))
>                  ((#\%) (parse-hex-escape (prog1 i (incf i 2))))
>                  (otherwise (char-code char)))
>                vector))
>         #+allegro (excl:octets-to-string vector :external-format
>         :utf-8)))))

Thanks for that.  I admit that the current version of URL-DECODE is
not ideal but your version will break existing code.  Note that
browsers will use different URL encodings based on the charset of the
HTML document they're responding to.  For example, if the charset is
ISO-8859-1 (which AFAIK is the default charset for Apache) the string
"äöü" (that's umlaut a, umlaut o, umlaut u in case it doesn't make it
through email) will be sent as

  %E4%F6%FC

which the version of URL-DECODE above won't decode correctly - it'll
expect

  %C3%A4%C3%B6%C3%BC

instead.  Unfortunately, the browsers don't tell you which charset
they're using... :(

The right way to do it would be to add a second optional argument for
the charset to URL-DECODE and make the default value user-configurable
on a per-request basis.  Does that sound OK?  I'll probably add
something like this in the next days.

Cheers,
Edi.

PS: For LispWorks use EXTERNAL-FORMAT:DECODE-EXTERNAL-STRING and
    EXTERNAL-FORMAT:ENCODE-LISP-STRING but see the recent discussion
    on the LW mailing list w.r.t. delivered applications:

      <http://thread.gmane.org/gmane.lisp.lispworks.general/4524>



More information about the Tbnl-devel mailing list