From scaekenberghe at common-lisp.net Tue Jan 31 11:57:46 2006 From: scaekenberghe at common-lisp.net (Sven Van Caekenberghe) Date: Tue, 31 Jan 2006 12:57:46 +0100 Subject: [s-xml-devel] Changes Message-ID: <08FC73CF-EE8E-43E7-91D4-701641DB3635@common-lisp.net> Hi, This month, David Tolpin contributed a set of interesting changes to S-XML, which have been integrated into CVS head, awaiting inclusion into the released version later on (when nobody protests). Thanks a lot David! You obviously took a good look at the source code and know a lot about XML. From the Changelog: 2006-01-19 Sven Van Caekenberghe * added a set of patches contributed by David Tolpin dvd at davidashen.net : we're now using char of type Character and #\Null instead of null, read/unread instead of peek/ read and some more declarations for more efficiency - added hooks for customizing parsing attribute names and values Copied and pasted from email conversations with David: > attached are my patches to s-xml, against the current CVS versions. > Most changes are type fixes and optimizations: char is declared as > character and uses #\Null as exceptional value instead of nil (XML > cannot contain #\Null character). This allows to declare char as > character, as well as fixes type errors in case of faulty XML > files: in a few places the original code contains > > (char= (read-char stream nil nil) #\SomeChar) > > which will yield error if end-of-file is actually met. There is > also a change in parse-identifier (and probably in other similar > functions) that replaces peek->read with read->unread sequences. > The thing is that an XML identifier is probably much more than a > single character, and thus peek+read requires twice as many > function calls as read+unread. > > One other fix, and I will understand you if you reject it is > defining callbacks (with fallback to the current behavior) > *attribute-name-parser* and *attribute-value-parser*. They allow to > parse attribute instream, without reconsing the attribute list. > This has been important for me, I use S-XML to read multimegabyte > files and need to spend at most a second on it. > > It helps decrease memory consumption, too, the current call in my > code is: > > (let ((s-xml:*ignore-namespaces* t) > (s-xml:*attribute-name-parser* #'attn-by-name) > (s-xml:*attribute-value-parser* > #'(lambda (name string) > (declare (type attn name)) > (funcall (attn-parse name) string)))) > (s-xml:start-parse-xml > input > (make-instance 's-xml:xml-parser-state > :seed seed > :new-element-hook #'new-element-hook > :finish-element-hook #'finish-element-hook))) > > that is, attribute names and values are parsed before being added > to the attribute list. > > I've also changed processing of the attribute list when namespaces > are turned on so that it is patched in place and not reconsed. And some clarifications later on: >> - aren't you misusing *ignore-namespaces* as a toggle for your >> attribute-[name|value]-parse functionality ? > > they are called in different places with and without namespaces. > Without namespaces, name/value calls can be applied immediately > when each attribute is read. With namespaces, they must be delayed > until all attributes are resolved. > >> - couldn't we move some of the tests surrounding the attribute- >> [name|value]-parser funcalls to the (default) implementations ? > > I've looked again and don't think so, otherwise non-default > implementations won't be transparent. > >> - isn't >> (defun parse-attribute-value (name string) >> "Default parser for the attribute value" >> (declare (ignore name) >> (special *ignore-namespace*)) >> (if *ignore-namespaces* >> (copy-seq string) >> string)) >> wrong ? > > Without namespaces, parse-attribute-value is called on every > attribute. This means that the default implementation must copy the > value, but a non-default one does not have to do so, instead, it > can convert the value into an integer or a symbol. This saves about > 10 megabytes of consed memory on a 3 Mb source. > > With namespaces, the value is already copied before the default > implementation is called, and there is no sense to copy it again - > that would, again lose 10 Megabytes on the same 3 Mb file. > >> I mean, I think the string should always be copied or never, no ? >> How does this depend on namespaces being used ? > > Because with namespaces, attribute values are always copied in > parse-*-attributes. Without namespaces, the copying can be avoided. > >> - isn't the attribute-[name|value]-parser called twice for each >> attribute ? I am confused with my own code ! It has been a while >> since I looked at it. > > It is either called when each attribute is parsed (when *ingore- > attributes* is nil) or when the element is composed, when *ignore- > attributes* is t. This is purely an efficiency issue, I wanted to > preserve the performance which I got from S-XML before introduction > of namespace handling. It was a long time since I looked at the source code of S-XML and David had a better view on it ;-) Sven -- Sven Van Caekenberghe - http://homepage.mac.com/svc Beta Nine - software engineering - http://www.beta9.be "Lisp isn't a language, it's a building material." - Alan Kay