Mark Pilgrim writes (and rants) of the pain that is parsing RSS.
For some reason, even people well-aware of XML, HTML and the gap that's between them don't bother to check that their RSS feeds are actually well-formed XML. People leave stray '&' around; people include HTML entities (such as ") although XML only has five built-in entities; they do other mistakes.
This is, in fact, one of the first changes I made in Aggie (and perhaps the largest) -- I added a "massage" stage before loading each RSS feed into the .NET XML parser to perform partial HTML entity decoding. I can tell you that was a pain to debug.
Then again, perhaps that's not people's fault? A lot of this pain would have been eliminated had XML supported HTML entities, and could have handled stray ampersands. Not to mention our favorite subject of encoding. If a person like Dave Winer has an RSS feed that MSXML/IE refuses to display (apparently, in his current feed lacks an encoding declaration, which means that the parser assumes UTF-8; some characters in the feed itself are not UTF-8), perhaps the tools need to be modified, not people.
|