Paul Brown's Weblog

ATTENTION: This blog has moved to a new location. Please update your bookmarks, browsing habits, and aggregators.

Sunday, September 28, 2003

Some Thoughts on StAX

This post (and this weblog) has a new home.

StAX is a new pull-oriented API for streaming XML that is part of the JCP as JSR-173, and XML.com has an introductory article from Elliotte Rusty Harold.

A pull-oriented API allows an application to request events from the underlying parser, while a push-oriented API (e.g., SAX) feeds events to listeners or handlers. Both approaches have advantages and disadvantages. Push-oriented APIs typically require awkward and complex programming patterns to manage state, and pull-oriented APIs typically require big switch statements and if/else blocks.

Comments on StAX

The StAX specification was recently released for public review, and I had time to kill on a plane ride from the west coast — so I had a look. I'm generically in favor of having an XML pull-parsing API for Java (and XMLPull needed some work and broader support), but I also found a few items worth commenting on:

One of the requirements is:
R03: When parsing an XML document that uses namespace prefixes, the API must be able to create a new XML document that uses the same namespace prefixes as the original document.
And this broken in a subtle way because the only reason for having this requirement is that someone down the chain is depending on the prefix instead of the URI. Some of unsupported items in section 3.1, e.g., the ability to maintain attribute order or the whitespace between attributes, are downright heretical, so while I'm happy to see that they are optional, I'm concerned that they were even considered. There is no reason to clutter a pull API with functionality for round-tripping markup, i.e., functionality that is only useful for a text editor implementation. None of the documented use cases justify the requirement.

I'm on the fence with R11:
R11: The API must provide the ability to configure the processor to stop a subset of information items from being delivered (for example ProcessingInstructions, Comments & Ignorable Whitespace).
APIs should generally steer clear of implementing non-canonical functionality (SAX does a great job.), and R11 violates that rule by specifying functionality that could (i.e., should) be supplied by a trivial filter implementation. (On this subject, the XMLInputFactory class has several superfluous methods, e.g., three createXMLEventReader() methods that all wrap trivial functionality for dealing with InputStreams and Readers.) Ignorable whitespace doesn't make sense, either, given that ignorable whitespace is only determined in the presence of a grammar (i.e., a schema or DTD) but that grammars are out of scope for StAX. R20, which requires a bi-directional mapping to SAX, is a much more reasonable requirement because there is a canonical implementation.

For those who don't read the permathreads on xml-dev, XML namespaces is contentious, and two distinct versions of the W3C recommendation exist in practice, the original recommendation from 1999 that states that the xmlns is not bound to any URI and the errata that states that the xmlns prefix is bound to the URI http://www.w3.org/2000/xmlns/. SAX obeys the original recommendation, and DOM (as of L3) follows the errata. (Namespace processing with DOM is a royal mess, and it's easier but still no picnic with SAX either; see Simon St. Laurent's overview for some levelheaded discussion.) Thus, I'm initially confused by:
R14: The API must support XML 1.0 and XML Namespaces.
While the references only cite the recommendation, Section 4.8.2 suggests that StAX takes the perspective of the errata.

The handling of entity replacements as optional (see the list of features) is ill-posed becaues the XMLEventReader interface doesn't provide a way to deal with entity replacements in attribute values.

The javax.xml.stream.Location interface needs to specify the numbering of the columns and lines, i.e., starting at one or starting at zero.

Do we really need another interface named XMLFilter? Having a couple of ContentHandler interfaces around (one in org.xml.sax and one in java.net is bad enough.

The XMLResolver behavior seems awkward. I may just be used to the SAX InputSource behavior, but a single encapsulation with one behavior (instead of an arbitrary series of fallbacks) seems like a better fulfillment of the simplicity requirement from the specification. For general applications, such as streaming XML events from a database query, non-XML markup, or other source, implementers will have to supply their own factory implementations.

The NamespaceContext interface assumes that there is a bijective mapping between prefixes and URIs, but that isn't necessarily the case. Thus, the getPrefix() should return an Iterator instead of a String. (Or getPrefixes() and getURIs() methods should be supplied.)

There is no per-instance way to set an XMLReporter instance on an XMLEventReader or XMLStreamReader; this has to be done via a method on the XMLInputFactory. The central role for XMLInputFactory with no generic encapsulation for input source (i.e., like org.xml.sax.InputSource) makes the StAX API unattractive or awkward for implementing sources based on input other than XML markup on a byte stream.

And now I'm out of time for this plane trip. I feel (only a little) guilty taking cheap shots from the sidelines without making any constructive suggestions, but in addition to seeing at least one more revision of the specification before release, my wishlist would be:

Less reinvention. Among other things, use the existing SAX encapsulations, e.g., InputSource and EntityResolver. These encapsulations are already part of the Java world, e.g., in the TrAX APIs.

Less markup-centric. Fine-grained control over markup (e.g., entity replacement) is inappropriate for a streaming API; that's the domain of internal parser constructs.

Less implementation-think. The API feels like it was extracted from an implementation as opposed to created by an architect with a goal in mind.

I'll be watching for the next version.

Pull Alternatives

The three pull-oriented API alternatives that I'm aware of are:

XMLPull, the original pull-parsing API for Java, whose creators sit on the JSR working group;

xmliter from Mark Hayes that builds on SAX; and

NekoPull from Andy Clark that builds on XNI.

9:31:25 PM