From alex.mizrahi at gmail.com Sun Nov 26 12:55:21 2006 From: alex.mizrahi at gmail.com (Alex Mizrahi) Date: Sun, 26 Nov 2006 14:55:21 +0200 Subject: [cl-ppcre-devel] *regex-char-code-limit* Message-ID: i have an implementation that reports char-code-limit less than actual -- it's ABCL (working on top of Java), only 256 codes are officially suported, but it uses Java strings, so there's no problem with handling Unicode strings -- i set *regex-char-code-limit* to some 10000 (thanks, Edi!). however, there are characters like 0xFFEF (the BOM), so i should set *regex-char-code-limit* to 65535. i think it's overkill to do that -- i see ppcre creates array of that size to do matching. how do people cope with it on unicode-enabled lisps? (afaik SteelBank uses UCS-4 char codes, so there's definitely no sane char-code-limit) does ppcre create that for each scanner? if there's one global array that's ok, but array for each scanner is too much.. does *use-bmh-matchers* affect usage of this array? if so, would it be much slower if i disable it? From edi at agharta.de Sun Nov 26 22:09:04 2006 From: edi at agharta.de (Edi Weitz) Date: Sun, 26 Nov 2006 23:09:04 +0100 Subject: [cl-ppcre-devel] *regex-char-code-limit* In-Reply-To: (Alex Mizrahi's message of "Sun, 26 Nov 2006 14:55:21 +0200") References: Message-ID: On Sun, 26 Nov 2006 14:55:21 +0200, "Alex Mizrahi" wrote: > i have an implementation that reports char-code-limit less than > actual -- it's ABCL (working on top of Java), only 256 codes are > officially suported, but it uses Java strings, so there's no problem > with handling Unicode strings -- i set *regex-char-code-limit* to > some 10000 (thanks, Edi!). however, there are characters like > 0xFFEF (the BOM), so i should set *regex-char-code-limit* to > 65535. i think it's overkill to do that -- i see ppcre creates array > of that size to do matching. > > how do people cope with it on unicode-enabled lisps? (afaik > SteelBank uses UCS-4 char codes, so there's definitely no sane > char-code-limit) > > does ppcre create that for each scanner? if there's one global array > that's ok, but array for each scanner is too much.. > > does *use-bmh-matchers* affect usage of this array? Yes. If you set it to NIL, you don't create BMH matchers and that's where the arrays are needed. The limit is also used in a few cases related to hash tables for character classes, but I think this is not really important. > if so, would it be much slower if i disable it? BMH matchers will only help you if your regular expression starts or ends with constant strings (the longer, the better) /and/ if your target strings are very long. HTH, Edi. From alex.mizrahi at gmail.com Sun Nov 26 20:30:20 2006 From: alex.mizrahi at gmail.com (Alex Mizrahi) Date: Sun, 26 Nov 2006 22:30:20 +0200 Subject: [cl-ppcre-devel] *regex-char-code-limit* In-Reply-To: <1164572134.8361.4.camel@localhost.localdomain> References: <1164572134.8361.4.camel@localhost.localdomain> Message-ID: > Hmm ... no! I can't think of a single use case where i would need to > treat the BOM as part of the content. Actually, i can only come to the > conclusion that a BOM within the content would be a serious bug. After > all, your appication should _never_ deal with the binary representation, > only with code points. What _code point_ do you get for BOM? i just download HTML pages using Java functions into Java strings. then i use CL-PPCRE to extract some information from it. certainly, i don't care about BOM, but CL-PPCRE crashes on it trying to aref array beyong char-code-limit. i can pre-filter data removing BOM, but i'm not guaranteed that i won't get some other wild character. well, there are better ways to tokenize HTML, but i've made quick and dirty solution via CL-PPCRE :)