I was reading Paul Graham's writings on spam filtering this morning where he applied statistics to word usage in emails to sort out spam. Paul reports that this actually gives amazingly good (to me anyway) results. Something like failure to recognise 5 out of a thousand spams with 0 false positives.
The real change here is that he is using statistics on user sorted email. In my mind this harks back to my calls for using statistics to watch what the user does. I personally have major email sorting problems (think fourty odd email lists spread across three, soon to be two, email accounts).
I am interested to see if single word statistics are enough to sort all my email. I think I have dreamed up a user interface that would work - and it involves abusing IMAP within an inch of it's spec.
The advantage of abusing IMAP is that all the current imap clients can be used with my initial prototype, thus eliminating the need to re-invent that wheel. I suspect I should be able to use Apache's Java email server as my experimental base. James is what I was thinking of, but it appears to only does POP3. I wonder how much pain would be involved in making it speak IMAP. Hmmm.
My real ponderance here is wondering if this is just going to be a personal hack for the fun of it, or whether this a commercially viable idea. Obviously Microsoft & Co are going to be in this area soon. If all that is required is single word statistical analysis I suspect there will not be enough oxygen in this space.
However, I suspect when you start doing more than spam/not spam that a simple bayesian model will not be enough. I have a raft of ideas for extentions, eg switching to Fuzzy logic, using sentence decompisition, using word proximity pairs, etc, that could make this space more livable for the fast moving small software co.
But could this be a sale-able product? Would people pay for this? Obviously in places that use IMAP currently (say linux/bsd/solaris/*) this is reasonably easy to slot in.
For places that use Exchange, I can't see how to do it. I don't want to have to integrate with exchange. It's a pain tolerance thing. Of course there is the approach of writing an outlook plugin. Hmmm.
Thoughts?
[later...] Here's a handful of thoughts culled from the blogosphere:
[Even Later ...] Started hacking through the James source base, and there appears to be the start of IMAP support. That is easier to start with than nothing at all... Hey there even appears to be a reasonably active developer community. Kewl.
1:05:30 PM
|