From robert.brown at gmail.com Mon Mar 12 15:10:04 2012 From: robert.brown at gmail.com (Robert Brown) Date: Mon, 12 Mar 2012 11:10:04 -0400 Subject: [cl-ppcre-devel] behavior of \w Message-ID: Some folks I work with are using cl-ppcre. They've run into an incompatibility between cl-ppcre and the PCRE library that boils down to cl-ppcre's handling of \w. The behavior is documented in cl-ppcre's manual: CL-PPCRE uses ALPHANUMERICP to decide whether a character matches Perl's "\w", so depending on your CL implementation you might encounter differences between Perl and CL-PPCRE when matching non-ASCII characters. This reliance on ALPHANUMERICP may be a misfeature. It means that cl-ppcre behaves differently depending on the Lisp implementation it's running on. My co-workers desire compatibility between cl-ppcre on SBCL (where ALPHANUMERICP follows Unicode) and PCRE for matching Latin-1 encoded strings. They patched the cl-ppcre code to make \w match a-z, A-Z, 0-9, and underscore. Is there a better workaround for them? bob From edi at agharta.de Mon Mar 12 15:18:38 2012 From: edi at agharta.de (Edi Weitz) Date: Mon, 12 Mar 2012 16:18:38 +0100 Subject: [cl-ppcre-devel] behavior of \w In-Reply-To: References: Message-ID: If they insist on using "\w", there's no portable way to change this except for patching the code. Otherwise, they could of course use a character class or add their own property resolver. Cheers, Edi. On Mon, Mar 12, 2012 at 4:10 PM, Robert Brown wrote: > Some folks I work with are using cl-ppcre. ?They've run into an > incompatibility between cl-ppcre and the PCRE library that boils > down to cl-ppcre's handling of \w. ?The behavior is documented in > cl-ppcre's manual: > > ?CL-PPCRE uses ALPHANUMERICP to decide whether a character > ?matches Perl's "\w", so depending on your CL implementation you > ?might encounter differences between Perl and CL-PPCRE when > ?matching non-ASCII characters. > > This reliance on ALPHANUMERICP may be a misfeature. ?It means > that cl-ppcre behaves differently depending on the Lisp > implementation it's running on. > > My co-workers desire compatibility between cl-ppcre on SBCL > (where ALPHANUMERICP follows Unicode) and PCRE for matching > Latin-1 encoded strings. ?They patched the cl-ppcre code to make > \w match a-z, A-Z, 0-9, and underscore. ?Is there a better > workaround for them? > > bob > > _______________________________________________ > cl-ppcre-devel site list > cl-ppcre-devel at common-lisp.net > http://common-lisp.net/mailman/listinfo/cl-ppcre-devel > From robert.brown at gmail.com Mon Mar 12 15:59:28 2012 From: robert.brown at gmail.com (Robert Brown) Date: Mon, 12 Mar 2012 11:59:28 -0400 Subject: [cl-ppcre-devel] behavior of \w In-Reply-To: References: Message-ID: Thanks very much for the property resolver suggestion. There was some feeling I think that using character classes would be messy. It also looks like the property resolver solution might allow the compiler to inline a custom matcher, if it has been decorated with the right declarations. Thanks again. bob On Mon, Mar 12, 2012 at 11:18 AM, Edi Weitz wrote: > If they insist on using "\w", there's no portable way to change this > except for patching the code. > > Otherwise, they could of course use a character class or add their own > property resolver. > > Cheers, > Edi. > > > On Mon, Mar 12, 2012 at 4:10 PM, Robert Brown wrote: >> Some folks I work with are using cl-ppcre. ?They've run into an >> incompatibility between cl-ppcre and the PCRE library that boils >> down to cl-ppcre's handling of \w. ?The behavior is documented in >> cl-ppcre's manual: >> >> ?CL-PPCRE uses ALPHANUMERICP to decide whether a character >> ?matches Perl's "\w", so depending on your CL implementation you >> ?might encounter differences between Perl and CL-PPCRE when >> ?matching non-ASCII characters. >> >> This reliance on ALPHANUMERICP may be a misfeature. ?It means >> that cl-ppcre behaves differently depending on the Lisp >> implementation it's running on. >> >> My co-workers desire compatibility between cl-ppcre on SBCL >> (where ALPHANUMERICP follows Unicode) and PCRE for matching >> Latin-1 encoded strings. ?They patched the cl-ppcre code to make >> \w match a-z, A-Z, 0-9, and underscore. ?Is there a better >> workaround for them? >> >> bob >> >> _______________________________________________ >> cl-ppcre-devel site list >> cl-ppcre-devel at common-lisp.net >> http://common-lisp.net/mailman/listinfo/cl-ppcre-devel >> > > _______________________________________________ > cl-ppcre-devel site list > cl-ppcre-devel at common-lisp.net > http://common-lisp.net/mailman/listinfo/cl-ppcre-devel From madscience at google.com Wed Mar 28 00:52:39 2012 From: madscience at google.com (Moshe Looks) Date: Tue, 27 Mar 2012 17:52:39 -0700 Subject: [cl-ppcre-devel] possible bug with non-word boundaries and greedy matching Message-ID: I have an odd little regular expression \B.* that seems to generate inconsistent behavior between CL-PPCRE and other regular expression engines that I've tried it with. Using CL-PPCRE: CL-USER> (cl-ppcre::scan "\\B.*" "foo bar baz") NIL This matches in Perl, PPCRE, and RE2. Am I missing something, or should CL-PPCRE be matching this as well? Oddly enough, change * to + leads CL-PPCRE to match: CL-USER> (cl-ppcre::scan "\\B.+" "foo bar baz") 1 11 #() #() It seems like + matching implies that * should as well... Thanks, Moshe