Friday, August 16, 2002

In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.[Paul Graham (via 0xDECAFBAD and Bitworking
Anybody who sends me an email that contains <font color="ff0000"> deserves to get thrown in the spam bucket, false positives be damned!

All kidding aside, this is a really interesting article. One heuristic that I've been using (manually) is to discard any email with garbage in the subject line. Spammers have been injecting random characters into the subject to prevent the trivial subject filtering approach, but finding a nonsense word in the subject line is also a nearly dead giveaway that the email's spam (barring a fat fingered or spelling impaired sender). Paul mentions whitelists, but there's been some recent incidents of spammers hijacking other's email addresses. One of my friends recently gave up his longtime address because a spammer was using it as the reply-to. So with this, whitelist filtering is probably out, unless digital signatures are used to verify the sender. But of course, digital signatures are probably as computationally intensive as Paul's statistical method, so this negates the advantage that whitelist filtering is computationally cheaper. At the end of his article, Paul mentions extending his filtering to word pairs or triples. This is conceptually similar to how a good encoding scheme works, the best schemes recognize that in English, "th" is far more common than many single characters, and "Q" almost never appears without a "u" following. Actually, I see quite a few paralells between Paul's article and some of the aproaches to encoding and compression that I've heard of.

1:35:32 PM  permalink  


Stories
DateTitle
8/13/2002 Resolution for IE and Windows problems
8/10/2002 Supporting VS.NET and NAnt
5/11/2002 When do you stop unit testing?