![]() |
Monday, September 09, 2002 |
Found a link via Brett Morgan, about a Python version of Paul Graham's Bayesian Spam filter. Tim Peters, the author of the linked email, pretty well debunks the use of word bigrams for filtering, and he's got some interesting results. Paul's article now contains links to some other resources; interestingly, Microsoft apparently has a patent on a related idea. Paul also lists a link to an article on /. where a couple posters proposed that spammers could start converting their content to images and use the <img> tag in HTML email. I think that this would fail for a couple reasons; using the "Delete as Spam" button, the <img> tag would quickly become a spam indicator. Someone told me the other day that they read a statistic that spammers send 12 million emails to get 3 responses. You'd have to cut that to 0 in 12 million to get them to stop. Given that most people aren't motivated enough to load any special filtering software (come to think of it, the people who click through spams probably like getting spammed), it seems like you'd have to get industry adoption on a grand scale to make this work. Hotmail, Yahoo! Mail, and AOL would all have to start doing this, and Exchange and Outlook would have to support it out of the box. So while it might be an effective technique, it probably won't do much to stop spammers. In the original email, Tim Peters also makes a pithy comment that we should all take to heart concerning programming in general:
5:31:03 PM permalink
|
My feeling is that we already have Binary XML. Just open the file in a hex editor ;-) Seriously, I think that the best argument for Binary XML is message size, and everyone's already pointed out that compression works well. Incidentally, the current nightly build for Apache SOAP 2.3.1 includes GZIP encoding support. Ingo mentioned that SOAP would be the ideal use for binary XML, but I find this paradoxical - this is where interop matters most. Anyway, I have no doubt that someday, somebody will come up with a binary encoding, get somebody big (MS, probably) to support it, and we'll see how the market accepts it. My guess is that plain old XML will win out. 4:36:39 PM permalink
|