[cl-ppcre-devel] Using (unsigned-byte 8) instead of a (string) as TARGET-STRING

Philipp Marek philipp at marek.priv.at
Thu Apr 26 19:29:28 UTC 2012


Hello everybody,

I've got a first (unoptimized) patch that allows to use (unsigned-byte 8)
instead of a string as TARGET-STRING. My motivation is to use that for
searching in big, binary files, which might not fit into RAM with byte =>
character conversion (which would be 1:4, as the binaries would have to be read
as latin1 or similar).


Using that patch on a ~3MB file with the string match at about 80% of the file
size shows a nice speedup: only half to a third cpu time used, and much less
memory usage (for the string).


Details:
-rw-r--r-- 1 root root 3176746 Mai  3  2010 /usr/share/doc/gcc-4.4-doc/gccint.html

string, case-sensitive:
  0.360022 seconds of total run time (0.360022 user, 0.000000 system)
  907,556,608 processor cycles
string, case-insensitive:
  0.492031 seconds of total run time (0.492031 user, 0.000000 system)
  1,239,853,076 processor cycles

(unsigned-byte 8), case-sensitive:
  0.108006 seconds of total run time (0.108006 user, 0.000000 system)
  274,043,748 processor cycles
(unsigned-byte 8), case-insensitive:
  0.220013 seconds of total run time (0.220013 user, 0.000000 system)
  553,027,836 processor cycles


The small "problem" is this (one long line):

  $ time perl -e '$/=undef; $_=<>; print $1,$2,"\n" if
    /acr([o0] i)s not defined,\s+the default(\Dvalue,\s*\d+, i)s used/'
    < /usr/share/doc/gcc-4.4-doc/gccint.html
  o i value, 1, i
  real    0m0.016s
  user    0m0.004s
  sys     0m0.008s

  $ perl -v
  This is perl 5, version 14, subversion 2 (v5.14.2)
  built for x86_64-linux-gnu-thread-multi

ie. (this) perl5 is still ~8 times faster, including file reading etc. (what
the lisp code didn't take into the measurement).
(With the /i modifier it's 0.020s.)


I've not yet tried to run the whole test suite against that. There are quite
a few warnings (unused variable UB8-MODE etc.) - but with higher SAFETY the
original CL-PPCRE gave a lot of them, too.


I'd like to ask for a quick look at the patch, to get some feedback; with the
many duplications I don't really like the result, but the duplicated accesses
("schar" etc.) are too deeply integrated in cl-ppcre, I couldn't easily get
them out into a single macro or something like that.


Cyrus, Edi, could you help me clean up the changes so that they
could be taken upstream?


The other file is the one I'm using for testing.


Regards,

Phil

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ub8-test.lisp
Type: application/octet-stream
Size: 1139 bytes
Desc: not available
URL: <https://mailman.common-lisp.net/pipermail/cl-ppcre-devel/attachments/20120426/f0c7142c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cl-ppcre-ub8-mode.patch.gz
Type: application/x-gzip
Size: 10048 bytes
Desc: not available
URL: <https://mailman.common-lisp.net/pipermail/cl-ppcre-devel/attachments/20120426/f0c7142c/attachment.bin>


More information about the Cl-ppcre-devel mailing list