Monday, September 09, 2002

Found a link via Brett Morgan, about a Python version of Paul Graham's Bayesian Spam filter.  Tim Peters, the author of the linked email, pretty well debunks the use of word bigrams for filtering, and he's got some interesting results.  Paul's article now contains links to some other resources; interestingly, Microsoft apparently has a patent on a related idea.  Paul also lists a link to an article on /. where a couple posters proposed that spammers could start converting their content to images and use the <img> tag in HTML email.  I think that this would fail for a couple reasons; using the "Delete as Spam" button, the <img> tag would quickly become a spam indicator.  Someone told me the other day that they read a statistic that spammers send 12 million emails to get 3 responses.  You'd have to cut that to 0 in 12 million to get them to stop.  Given that most people aren't motivated enough to load any special filtering software (come to think of it, the people who click through spams probably like getting spammed), it seems like you'd have to get industry adoption on a grand scale to make this work.  Hotmail, Yahoo! Mail, and AOL would all have to start doing this, and Exchange and Outlook would have to support it out of the box.  So while it might be an effective technique, it probably won't do much to stop spammers. 

In the original email, Tim Peters also makes a pithy comment that we should all take to heart concerning programming in general:

...the way to make a scheme like this excellent is to keep your ego out of it and let the data *tell* you what works...

5:31:03 PM  permalink  

Clemens Vasters, Ingo Rammer, and Brad Wilson are all debating binary XML.  [News from the Forest] I agree with Justin's comment that anything destined to last longer than a transitory message should be in XML 1.0, complete with angle brackets. I've been back and forth as to whether a binary XML format should be tied to a schema, but in the end I don't think it should, however you should be able to take the schema and produce an optimized reader if you want to. You could intern all the strings, and make the end element marker not require the element name to compact down the representation. [Simon Fell]

The big win of XML is interoperability. XML compresses very nicely with gzip compression, so if you're worrying about XML size, just use gzip over HTTP which as Simon says, most HTTP / SOAP engines support (or should support).[James Strachan's Radio Weblog] [via Brett Morgan's Insanity Weblog Zilla]

My feeling is that we already have Binary XML.  Just open the file in a hex editor ;-)  Seriously, I think that the best argument for Binary XML is message size, and everyone's already pointed out that compression works well.  Incidentally, the current nightly build for Apache SOAP 2.3.1 includes GZIP encoding support.  Ingo mentioned that SOAP would be the ideal use for binary XML, but I find this paradoxical - this is where interop matters most. 

Anyway, I have no doubt that someday, somebody will come up with a binary encoding, get somebody big (MS, probably) to support it, and we'll see how the market accepts it.  My guess is that plain old XML will win out.

4:36:39 PM  permalink  


Stories
DateTitle
8/13/2002 Resolution for IE and Windows problems
8/10/2002 Supporting VS.NET and NAnt
5/11/2002 When do you stop unit testing?