[s-xml-devel] Changes

Tue Jan 31 11:57:46 UTC 2006

Hi,

This month, David Tolpin contributed a set of interesting changes to  
S-XML, which have been integrated into CVS head, awaiting inclusion  
into the released version later on (when nobody protests).

Thanks a lot David! You obviously took a good look at the source code  
and know a lot about XML.

 From the Changelog:

2006-01-19 Sven Van Caekenberghe <svc at mac.com>

	* added a set of patches contributed by David Tolpin  
dvd at davidashen.net : we're now using char of type
	Character and #\Null instead of null, read/unread instead of peek/ 
read and some more declarations for
	more efficiency - added hooks for customizing parsing attribute  
names and values

Copied and pasted from email conversations with David:

> attached are my patches to s-xml, against the current CVS versions.  
> Most changes are type fixes and optimizations: char is declared as  
> character and uses #\Null as exceptional value instead of nil (XML  
> cannot contain #\Null character). This allows to declare char as  
> character, as well as fixes type errors in case of faulty XML  
> files: in a few places the original code contains
>
>   (char= (read-char stream nil nil) #\SomeChar)
>
> which will yield error if end-of-file is actually met. There is  
> also a change in parse-identifier (and probably in other similar  
> functions) that replaces peek->read with read->unread sequences.  
> The thing is that an XML identifier is probably much more than a  
> single character, and thus peek+read requires twice as many  
> function calls as read+unread.
>
> One other fix, and I will understand you if you reject it is  
> defining callbacks (with fallback to the current behavior)  
> *attribute-name-parser* and *attribute-value-parser*. They allow to  
> parse attribute instream, without reconsing the attribute list.  
> This has been important for me, I use S-XML to read multimegabyte  
> files and need to spend at most a second on it.
>
> It helps decrease memory consumption, too, the current call in my  
> code is:
>
> (let ((s-xml:*ignore-namespaces* t)
>         (s-xml:*attribute-name-parser* #'attn-by-name)
>         (s-xml:*attribute-value-parser*
>          #'(lambda (name string)
>              (declare (type attn name))
>              (funcall (attn-parse name) string))))
>   (s-xml:start-parse-xml
>    input
>    (make-instance 's-xml:xml-parser-state
>                   :seed seed
>                   :new-element-hook #'new-element-hook
>                   :finish-element-hook #'finish-element-hook)))
>
> that is, attribute names and values are parsed before being added  
> to the attribute list.
>
> I've also changed processing of the attribute list when namespaces  
> are turned on so that it is patched in place and not reconsed.

And some clarifications later on:

>> - aren't you misusing *ignore-namespaces* as a toggle for your  
>> attribute-[name|value]-parse functionality ?
>
> they are called in different places with and without namespaces.  
> Without namespaces, name/value calls can be applied immediately  
> when each attribute is read. With namespaces, they must be delayed  
> until all attributes are resolved.
>
>> - couldn't we move some of the tests surrounding the attribute- 
>> [name|value]-parser funcalls to the (default) implementations ?
>
> I've looked again and don't think so, otherwise non-default  
> implementations won't be transparent.
>
>> - isn't
>> (defun parse-attribute-value (name string)
>>   "Default parser for the attribute value"
>>   (declare (ignore name)
>>            (special *ignore-namespace*))
>>   (if *ignore-namespaces*
>>       (copy-seq string)
>>       string))
>> wrong ?
>
> Without namespaces, parse-attribute-value is called on every  
> attribute. This means that the default implementation must copy the  
> value, but a non-default one does not have to do so, instead, it  
> can convert the value into an integer or a symbol. This saves about  
> 10 megabytes of consed memory on a 3 Mb source.
>
> With namespaces, the value is already copied before the default  
> implementation is called, and there is no sense to copy it again -  
> that would, again lose 10 Megabytes on the same 3 Mb file.
>
>>  I mean, I think the string should always be copied or never, no ?  
>> How does this depend on namespaces being used ?
>
> Because with namespaces, attribute values are always copied in  
> parse-*-attributes. Without namespaces, the copying can be avoided.
>
>> - isn't the attribute-[name|value]-parser called twice for each  
>> attribute ? I am confused with my own code ! It has been a while  
>> since I looked at it.
>
> It is either called when each attribute is parsed (when *ingore- 
> attributes* is nil) or when the element is composed, when *ignore- 
> attributes* is t. This is purely an efficiency issue, I wanted to  
> preserve the performance which I got from S-XML before introduction  
> of namespace handling.

It was a long time since I looked at the source code of S-XML and  
David had a better view on it ;-)

Sven

--
Sven Van Caekenberghe - http://homepage.mac.com/svc
Beta Nine - software engineering - http://www.beta9.be

"Lisp isn't a language, it's a building material." - Alan Kay