I was writing a character encoding FAQ for our company's Wiki when I came across something in the JAXP javadocs that has always puzzled me:
StreamResult Javadocs: Normally, a stream should be used rather than a reader, so that the transformer may use instructions contained in the transformation instructions to control the encoding.
JAXP has a similar warning for StreamSource as well.
Why is using character streams so ill-advised? The character encoding behaviour that XSL parsers should use seems pretty straightforward to me; everything should be keyed off of the output format specified in the XSLT template. If a character produced by the transformation is allowed by the desired encoding, output it (based on the user's preference) either as a single Java character or as a numeric entity. If the character is unsupported by the encoding, output it as a numeric entity. Isn't this is reasonable way to implement encoding?
I guess the point of the warning in the JAXP docs is that if you are given an arbitrary Writer, you can't be sure if it is backed by a StringBuffer or an underlying OutputStream. If it's a stream, it's certainly possible that the Writer will receive characters that the character-to-byte conversion of the OutputStream does not support.
However, it seems just as likely that, if you give the transformation an arbitrary OutputStream, the bytes produced will be subsequently misinterpreted by any code that converts them back into characters. In the simple case of performing an XSL transformation to a file, an OutputStream would definitely be safer. However if you were, for example, chaining the result of one transformation to another, using an OutputStream seems like an unncessary point of potential misconfiguration. I think JAXP's documentation is very poorly worded in this regard; the implication that OutputStreams are somehow safer seems false.
And on the topic of encoding, there is nothing convenient about the "convenience" classes java.io.FileReader and java.io.FileWriter. If a principal goal of Java's is platform-independence, then those two classes are failures. They really should have made those classes use UTF-8 encoding rather than the platform's default encoding. At work I see lots of encoding problems that are caused by developers who use those classes.