Tim On NBC

The following are Tim Peters' comments on naive Bayesian classifiers, and explain the motivation for using Graham instead of NBC's as the basis for the spambayes project. Via personal email.

I read several of the papers on NBCs, and knew a little about them before from a previous life, and my impression was they weren't *this* good at this specific task, at least not without major implementation effort involving lots of obscure technical tricks. There was also the testimony of ifile(*) users that it didn't do very well when asked just to distinguish spam from non-spam; indeed, reports were that it did better if told to distinguish get-rich-quick spam from human-growth-hormone spam from Nigerian-scam-span etc.

Recent papers on NBCs for the spam versus non-spam task were pulling tricks like stopword lists, and mutual information calculations, finding that these improved results by keeping "junk words" out of the calculation. Graham's scheme carries that to an extreme in an intuitively appealing way, and for the specific spam-vs-not-spam task it may well be that the problem *is* extremely easy if approached in a suitably extreme way. On a bang for the buck invested measure, I haven't written code for any other messy real-life task that performed this well with so little total effort. (So, yes, the bottom line is just that this has been fun .)

(*) ifile is a good implementation of "a classic" N-way NBC:

http://www.ai.mit.edu/~jrennie/ifile/

I doubt that Graham's approach would work well for a subtle classification problem -- but it's not trying to, and it works exceedingly well for what it is trying to do. That seems a stroke of genius to me.

		© Copyright 2006 Gary Robinson. Last update: 1/30/06; 2:48:12 PM.