[cl-ppcre-devel] Re: Programming Problem - Need your CL-PPCRE Experience

Edi Weitz edi at agharta.de
Wed May 19 10:19:00 UTC 2004


Hi Thomas!

Sorry for the delay.

On Sat, 15 May 2004 21:52:25 +0200, Thomas Schilling <tjs_ng at yahoo.de> wrote:

> I'm currently writing a pattern matcher for trees. This is
> (hopefully) going to become a tool for easily extracting information
> from tree-structured documents, specifically XML/HTML files. (Yes, I
> know of SAX and XSLT and XPath but I think this can become more
> powerful--I plan to allow embeddable code. Nevertheless it could be
> added to e.g. CL-XML.)
>
> In fact it's a regular expression matcher that two-dimensionally
> matches on lists. (car is considered to be the node's name,
> i.e. tag-name, and the cdr it's children). So that's the reason why
> I'm asking you.
>
> So here's my problem:
>   I, well, stole some ideas from you. Because we need heavy
> backtracking and looking ahead I implemented it (just like you, I
> think) largely via closures, i.e. each rule is matched via a
> function, and the rules are linked via a next-rule (function-)
> variable that is put inside each closure.  So e.g. the repetion
> function can check by itself if the rest of the pattern matches when
> the number of repetitions is tried. This works great except in the
> folloing case:
>
> When I have (in perl-notation) "abaa" =~ /^(a|ab)*a/ the match
> failes.  That's because the alternative which has only the
> information of the two choices "a" and "ab" but no successor
> information (because it can have two--the alternative itself or the
> "a" after the repetition) succeeds when matching the first
> alternative and never tries to match the second, although then the
> repetition could match.
>
> I currently have no real solution to this problem. How did you solve
> this? I looked it up in your code but I couldn't find it (maybe it's
> masked by the many optimizations?).
>
> I would be very thankful if you could help me.
>
> Ok, so here's the code: http://www.inf.tu-dresden.de/~s3815210/lisp
> tree-match.lisp. You can simply C-c C-k it. and then C-x C-e the
> (test-all) at the end of the file. It should print the one failing
> test.  The interesting functions are probably OR-M and
> GREEDY-REPEAT-M.  The problem is mentioned in line 286 in the
> PARSE-PATTERN function.

I only looked briefly at your code 'cause I'm busy with other things
but I /think/ the problem is in OR-M:

  (defun or-m (options next-rule)
    (declare (type function next-rule))
    (matcher-fn
      (do ((r options (cdr r)))
          ((null r) nil)
        (multiple-value-bind (pos inp)
                             (funcall (the function (car r)) input position)
          (when pos
            (multiple-value-bind (pos2 inp2)
                                 (funcall next-rule inp pos)
              (when pos2
                (return (values position inp2)))))))))

Here you try each function in OPTIONS in turn and if one of them
succeeds you immediately call NEXT-RULE so there's no chance for
backtracking. What you have to do is to construct a new function for
each element in OPTIONS which itself knows how to proceed (namely via
NEXT-RULE) and try these new functions instead of the ones in OPTIONS.

FWIW, here's the relevant code from CL-PPCRE:

  (defmethod create-matcher-aux ((alternation alternation) next-fn)
    ;; first create closures for all alternations of ALTERNATION
    (let ((all-matchers (mapcar #'(lambda (choice)
                                    (create-matcher-aux choice next-fn))
                                (choices alternation))))
      ;; now create a closure which checks if one of the closures
      ;; created above can succeed
      (lambda (start-pos)
        (declare (type fixnum start-pos))
        (loop for matcher in all-matchers
              thereis (funcall (the function matcher) start-pos)))))

I don't think this is masked by too many optimizations... :)

The important part is the creation of ALL-MATCHERS from (CHOICES
ALTERNATION) (which would be your OPTIONS).

Does that help?

This is just a wild guess from looking at your code for a couple of
minutes. If that doesn't help feel free to ask me again.

> Please ask me if you have any questions.
>
> If you think of a place to which should post this please tell me.

I've sent a copy to the cl-ppcre-devel mailing list. We should take
further discussion there so they'll be archived (if you don't mind).

HTH,
Edi.




More information about the Cl-ppcre-devel mailing list