|
NEW RANT |
|
  |
|
  |
WHO'S THIS ROBINSON GUY? |
|
|
  |
RANTS |
|
|
|
  |
BLOGROLLING |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The following are Tim Peters' comments on naive Bayesian classifiers, and explain the motivation for using Graham instead of NBC's as the basis for the spambayes project. Via personal email.
I read several of the papers on NBCs, and knew a little about them before
from a previous life, and my impression was they weren't *this* good at this
specific task, at least not without major implementation effort involving
lots of obscure technical tricks. There was also the testimony of ifile(*)
users that it didn't do very well when asked just to distinguish spam from
non-spam; indeed, reports were that it did better if told to distinguish
get-rich-quick spam from human-growth-hormone spam from Nigerian-scam-span
etc.
Recent papers on NBCs for the spam versus non-spam task were pulling tricks
like stopword lists, and mutual information calculations, finding that these
improved results by keeping "junk words" out of the calculation. Graham's
scheme carries that to an extreme in an intuitively appealing way, and for
the specific spam-vs-not-spam task it may well be that the problem *is*
extremely easy if approached in a suitably extreme way. On a bang for the
buck invested measure, I haven't written code for any other messy real-life
task that performed this well with so little total effort. (So, yes, the
bottom line is just that this has been fun .)
(*) ifile is a good implementation of "a classic" N-way NBC:
http://www.ai.mit.edu/~jrennie/ifile/
I doubt that Graham's approach would work well for a subtle
classification problem -- but it's not trying to, and it works
exceedingly well for what it is trying to do. That seems a
stroke of genius to me.
 |
 |
© Copyright 2006 Gary Robinson.
Last update: 1/30/06; 2:48:12 PM.
|
 |
|
|