Extensions to RSS 0.94 for Content and Encoding Formats

Chuck Shotton's Logic Faults
Things that make sense to me (and maybe only me).

Home

MacHTTP.Org

SlashDot

VersionTracker

MacNN

MacOSXHints

Leesburg Today

Washington Post

Drudge Report

Google News

Dave

Wes

Brent

Extensions to RSS 0.94 for Content and Encoding Formats

RSS 0.94 Content Type and Encoding - Chuck Shotton - revised 8/31/02, 12:00 GMT

Disclaimer

These are comments and suggestions for the draft RSS 0.94 standard as presented by Dave Winer of Userland Software. These comments are intended to supplement the draft specification and do not in any way represent an official extension to the standard unless they are incorporated into the draft being maintained by Dave. Also, the syntax and semantic choices expressed below are only suggested and subject to community comment and approval. Please send any suggestions, corrections, or additions to me at:
chuck 'at' shotton 'dot' com.

Introduction

The RSS standard has proven extremely useful for distributing Web-based content in a syndication format that is machine readable and also approachable by human authors. With an increased emphasis on machine-generated RSS, it is important that automated clients consuming this RSS data be able to unambiguously interpret the payload contained in various XML elements within a RSS file.

The initial versions of RSS fail to capture information about the text-based content contained within RSS files. Some RSS generators assume a plain text format. Others assume the standard implies HTML mark-up. Also, no specification is supplied in the current standards for specifying encoding formats such as entity encoding, URL encoding, UUEncoding, Base64, BinHex, or other text-based binary encoding standards.

Consequently, it is difficult for clients of RSS data to consistently render RSS feeds from multiple sources containing a wide variation of text formats. This document proposes two extensions to the RSS 0.94 standard that would disambiguate the text payload formats in a way that would make it possible for clients to correctly render the payload or fail gracefully for unsupported formats. It is hoped that these proposed extensions will also enhance the parameter passing mechanisms and other text manipulations that are part of the RSS 0.94 standard. These extensions are intended to apply primarily to <description> elements within the RSS file, but are generally applicable to all XML tags that can contain text payloads, including <title> and others.

Default Behaviors

In order to remain backward compatible with earlier RSS versions, it is important that all of the proposed additions to the RSS 0.94 standard be optional. To that end, the 0.94 standard needs to specify default behaviors in the absence of any explicit definition of content format or encoding syntax. This proposal suggests that the default content format be considered equivalent to the MIME type "text/html" and that the default encoding be that specified for plain text appearing in XML documents (i.e., entity encoding). The implication for clients is that the XML parsing process of a RSS file's text payload should replace encoded XML entities with the appropriate character(s) and then treat the resulting text string as HTML text which can potentially contain HTML mark-up syntax.

It is my opinion that defaulting to "text/html" as the default content format provides an additional processing burden for clients that cannot render HTML (e.g., WAP phones, PDAs, simple Web agents) that may prove unacceptable, but this format will encompass the vast majority of existing RSS feeds without requiring any change to the RSS generators producing the feeds. By specifying an explicit content type and encoding format, clients will be free to pick and choose how and whether to render specific RSS elements.

Content Type and Encoding Specification Syntax

Type
Some elements in the RSS 0.92 standard (e.g., <enclosure>) already include a "type" attribute that is used to specify a MIME type for the associated content. This proposal extends the use of the "type" attribute to the <description> element as well. It is debatable whether this behavior should also be extended to the <title> entity, since <title> is more properly considered to always be plain text. The default value for "type" in the absence of its specification is "text/html". MIME type specification in the "type" attribute should conform to the specification of a media type in the "Content-type:" header field in the HTTP standard (section 3.7 of RFC 2616). As with the content-type definition for HTTP/1.1, the type attribute should allow the specification of additional parameters separated from the media type specification by a semicolon. Clients should tolerate the presence of additional parameters when parsing, even if they are simply ignored. Don't assume a simple MIME type of the format "type/subtype".

Encoding
In addition to specifying the MIME type for a given payload, it may be desirable to indicate how the text content is encoded. It is possible that applications of RSS may include the distribution of non-text data (e.g., GIFs, application level data objects, etc.) In order to support this facility, it is proposed that all RSS tags that can accept the "type" attribute should also allow specification of an "encoding" attribute as well. The value of an encoding attribute should conform to the content-coding portion of the Content-Encoding specification in section 14.11 of RFC 2616. Primary values for the encoding attribute should be restricted to a subset of encoding formats suitable for text-only distribution, including, but not limited to "base64", "uuencode", "binhex", "url-encode","xml-entity", etc. These values are presented for illustrative purposes only and the actual tokens should be taken from the list of registered encoding types for HTTP/1.1, the XML standard, or other sources.

If multiple content encodings have been applied, the encoding attribute should specify the transforms in order that a client should apply them to extract the payload. (e.g., encoding="base64,gzip" implies a client should decode the base64 encoded payload, then treat the resulting data as gzipped content to be further decompressed.)

In all cases, the absence of an encoding syntax specification, the encoding format is assumed to be normal XML entity encoding. It is never allowable for unencoded XML tokens to appear in the text payload of an RSS file. Conforming XML parsers will refuse the RSS as malformed. (i.e., bare < or > characters are a bad thing!) Also, the assumption is that all RSS clients will perform entity decoding on the text payload before attempting to apply any encoding transformations or type interpretations.

Examples of use:

Example 1
<item>
<description type="text/plain">This is some &quote;plain&quote; text</description>
</item>

For this example, text/plain does not imply the absence of encoded entities. Rather, it says that once normal XML entity translation has been performed, the resulting text string is assumed to have no additional mark-up syntax to be interpreted and can be treated as plain text.

Example 2
<item>
<description type="text/html">This is some HTML text</description>
</item>

This example shows a simple HTML fragment that will be interpreted as:
This is some HTML text
after entity decoding is completed. A RSS client is then free to interpret the resulting HTML as it sees fit.

Example 3
<item>
<description>This is some more HTML text</description>
</item>

Example 3 merely illustrates that in the absence of a "type" attribute on the <description> tag, the assumption is "text/html" as in example 2 above.

Example 4
<item>
<description type="image/gif" encoding="base64">PRETENDTHISISBASE64DATA</description>
</item>

This example combines content type and content encoding specifications to enable the delivery of a binary payload through RSS.

Unresolved Issues

It's not clear what RSS tags "type" and "encoding" should apply to. <description> is clearly a candidate. Whether <title> and others should also allow these attributes is open to debate.
The canonical list of encoding types needs to be specified.
It's not clear whether the complete syntax for Content-Type and Content-Encoding as specified in the HTTP/1.1 standard should apply to RSS, or a simplified subset of basic MIME types and single tokens for encoding.