[cl-ppcre-devel] Questions regarding cl-unicode

Mon Jan 12 12:56:03 UTC 2009

Hi,

I just subscribed to this mailing list, which I believe is not only
for cl-ppcre but also for cl-unicode. If I am wrong, please point me
in the right direction :-)

My name is Juanjo and I am the maintainer of ECL
(http://ecls.sourceforge.net). I am currently interested on completing
the support for Unicode in ECL which is, more or less, at the level of
what SBCL provides and, in my opinion, far from optimal.

I have been pondering several options, but all of them seem like
reinventing the wheel, so I finally came to the conclusion that the
most sensible strategy would be to turn cl-unicode into a full
(optional) replacement of the ANSI Common Lisp functions for dealing
with characters and strings, and hope that this would become a
de-facto standard. Perhaps that is a too ambitious goal, or maybe it
is even futile, given the level of adoption of Unicode among lispers.

My concerns are now centered about several questions.

1) Optimize the database information that is built into cl-unicode.
ECL currently uses the SBCL procedure for compressing the database and
I believe this can be even optimized further. Instead of binary trees
or hashes, this leads to two-stages byte table that encodes the
currently 209 different combinations of properties. This is important
for ECL because we need it to stay lean and simple and because our
procedures for exporting data structures in compiled code are not
efficient, due to contrants in C compilers. One possibility is that
CL-UNICODE reuses the SBCL and ECL databases. Other possibility is to
look for even more efficient data stuctures.

2) Add support for the most important Unicode algorithms, which are
canonical decomposition of strings, string upper/lower/titlecasing,
and string collation. Ideally this should be transparently
incorporated into new Common-Lisp functions that can be used to
replace the old ones, such as char-upcase, string-equal, etc. Of
course, due to the differences between Unicode and ANSI CL, the
specifications would change.

3) Add support for the locales database provided by the Unicode
consortium. This is essential for implementing string collation, since
the ordering of characters is locale dependent.

4) Integration and shipping of cl-unicode with different
implementations, if possible. I would be interested on having
CL-UNICODE as a contributed package in the ECL source tree, so that it
can be activated with a simple configuration option. I believe there
are no license issues, and there is only the problem that CL-UNICODE
depends on CL-PPCRE (is this dependency essential? could it be
eliminated?)

Well, maybe this is all BS, but I would like to read your opinions on the topic.

Juanjo

--
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28009 (Spain)
http://juanjose.garciaripoll.googlepages.com