mardi 13 décembre 2005

4Suite closing in on a release candidate

Uche wrote that 4Suite XML, the mostly-pure-Python package for XML, XSLT, et al. is getting close to release candidate.

Congratulations to FourThought on their work over the many years on this set of technology. I recently did a few tasks using the latest-and-greatest 4Suite, including the Amara toolkit, and really enjoyed the experience. In particular, the user-contributed guide from Mike Brown helps sort through the different parsers and processors and shows, very quickly, the important steps for even the advanced stuff.

Amara was quite fun to work with. It is a layer that puts a more Pythonic interface on XML, similar to the goals of ElementTree (and thus lxml). I was particularly interested in Amara's type inference system, so that XML data could be "taught" how to become integers etc. when bound into Python.

I originally worked exclusively with lxml. Although it was missing a few features, most of the important things got added just when I needed it. Still, I was interested in looking at the XSLT extension functions (so XSLT can call Python), and lxml didn't have that wired in yet. In the bargain, I got to learn more about Amara.

This weekend I revisited lxml, which has been very active since I looked last at 0.6. The 0.7 and 0.8 releases seem to address the big thing that last gave me concern, being the stability of moving nodes between documents. And in fact, Stefan Behnel has a branch with initial support for extension functions.

I'm not a real developer, more like an XML scripter, so I'm not sure if I'm the right use case for outlining things missing. But, in order of importance, here are some things I'd like to see in lxml, to keep my wandering eye from looking around:

  1. XSLT extension functions. It looks like it is on its way, although it is part of some changes that might take a while to digest. I've wanted this as a way for templates to pull data out of Zope, on-demand, with parameters. However, now that I've learned more about XQuery via dbxml, this need has diminished.
  2. EXSLT. Support for the EXSLT extensions looks to be on a branch, so that might be done soon.
  3. HTML Parser. libxml2 has a nice HTML parser. Meaning, you can give it tag soup of non-well-formed XML and get back an XML DOM. Sure, this can be done with other libraries, but that means...shipping other libraries. It can also be a performance hit...parse into an XML DOM once, serialize to string, then re-parse into an lxml etree.
  4. Type inference. I think this could make a big difference in people's appreciation of the "pythonic" binding of ElementTree. I presume this requires discussion with effbot regarding the ElementTree API.
  5. Custom resolvers. I used to think I really, really wanted this. libxml2's Python binding supports this. Basically, it is like adding another protocol to the XML parser. With this, both XInclude and XSLT can refer to a body of content which is implemented as a custom Python function. For example, you could make all of Zope traversable. However, XQuery and dbxml have also made me rethink this one.
  6. SimpleXMLWriter. ElementTree has a little module for building up an XML document, from Python, in a simple/fast way. I recently did an experiment (serializing a CMF portal_catalog into an XML DOM) and found that doing it via lxml was 2.5 times slower than SimpleXMLWriter. (I made a mistake in my comparison. On an equal test, lxml is slightly faster.)
  7. Windows. Windows support for lxml has apparently been intermittent. If we could get this simplified and reliable, we could lobby Enfold to use lxml, which would bring some Windows firepower into the lxml project. (They are currently using libxml2 directly.)
  8. dbxml support. Yeh, I'm really reaching the bottom of the Christmas list with this one. [wink] I've played a bit recently with the 2.x series of Sleepycat's Berkeley DB XML open source XML database. It's a fascinating and transformative model for content management, IMO.
    Anyway, the next release promises support for libxml2 XmlDocument results, instead of pyana (albeit with caveats). It would surely be nice if dbxml's Python support was lxml instead of what they currently have. I realize this is an intersection of events that is unfair to ask for.

None of this list currently contains a show-stopper. lxml has come a long way, quickly. Martijn should be congratulated, not just for the hard work starting it, but cultivating a community as well.

I'm interested in XML technologies in the Zope stack. Both 4Suite and lxml are wonderful choices for investigating how objects and content can intersect.
9:43:23 AM   comment []