Sunday, August 18, 2002

My real ponderance here is wondering if this is just going to be a personal hack for the fun of it, or whether this a commercially viable idea. Obviously Microsoft & Co are going to be in this area soon. If all that is required is single word statistical analysis I suspect there will not be enough oxygen in this space.

However, I suspect when you start doing more than spam/not spam that a simple bayesian model will not be enough. I have a raft of ideas for extentions, eg switching to Fuzzy logic, using sentence decompisition, using word proximity pairs, etc, that could make this space more livable for the fast moving small software co.[Brett Morgan]

A couple thoughts.  The first is that spam doesn't just exist in English, for example, I've been receiving regular spams in Spanish for a while.  So a multi-language solution might be attractive, though you'd have to pick the languages wisely.  Probably not a lot of Finnish spam out there.  The other thought is that if you started making any money off this technique, spammers may get wise and start putting in innocuous sounding text in a hidden field, for instance <span style="display:none">Hey, buddy what's up?  When are we going to get together for lunch?</span>SPAMSPAMSPAM.  So I think that Paul's technique might need to become more context aware.  However, I was impressed with how much the "out of band" data gave away the content in his examples, so maybe I'm making too much of this.

5:50:36 PM  permalink  


Stories
DateTitle
8/13/2002 Resolution for IE and Windows problems
8/10/2002 Supporting VS.NET and NAnt
5/11/2002 When do you stop unit testing?