You are here

XML Parsing

Normally people don't need to bother about XML parsing since they typically use expat or a java parser. I wrote my own which wasn't very difficult.

The standard XML parser for Tcl developers is TclXML which has several backends to do the actual parsing. So far I have used the pure Tcl one which was patched by me several years ago to handle fragmented XML. It is likely very slow and a better choice would be to use its expat backend, but the C-Tcl glue code doesn't properly handle resets when done in callbacks which is necessary. It is probably not that hard to fix.

Some time ago I got complaints about very slow history parsing when starting chats. It turned out that the code was very stupid and is now much better. However, as a side effect I looked into the XML parsing since history files are stored as XML, and this was a bottleneck. So I wrote my own parser which is very much simplified with minimal error checking, only basic entities, no processing instructions etc. In other words, it is optimized for XMPP streams.

The basic code looks like this:

    variable tokRE <(/?)([cl ^$Wsp>/]+)([cl $Wsp]*[cl ^>]*)>
    variable substRE "\} {\\2} {\\1} {\\3} \{"
    ...
    regsub -all $tokRE $xml $substRE xml

and as you see the core part is a simple regular expression substitution command. Simple! There are, of course, a few more details to consider. The basic entity in Tcl is lists, and this code just takes a stream of mixed CHDATA and tags, and parses it into a long Tcl list where each four elements are the tag, is it closing, attributes, and CHDATA. It is then a fairly straightforward matter to loop through this list and do the callbacks.

I must admit that most parts of my parser (qdxml for Quick & Dirty) are taken from TclXML, but the trick above is much older than that. Now for the timings:

parsing jabber.ru.utf8.xml, 40774 bytes
qdxml: 518133 microseconds per iteration
tclxml: 1513549 microseconds per iteration
parsing long.xml, 40133 bytes
qdxml: 525725 microseconds per iteration
tclxml: 1892864 microseconds per iteration
parsing test.xml, 685 bytes
qdxml: 8911 microseconds per iteration
tclxml: 45444 microseconds per iteration

which gives a factor of 3-5 in favour of qdxml. Much better. I don't dare to switch to it right now, but testing hasn't shown any problems.