Old Spam Detection

Update: I have removed "Further Improvement 2" because it looks like real-world data is messy enough that it is doubtful that it will offer much benefit without a lot more work.

A fair amount of testing has been done since this essay was posted on Sept. 16, 2002 and the results look promising. The first test that was done that combines the original idea and "Further Improvement 1" is discussed here and it is very positive. In general the spambayes mail list has emerged as the center of testing, so if you're interested, it would make sense to go there. Also, this essay is undergoing extensive revisions as feedback comes in. Where someone has made an important contribution, I'll point it out.

It's been exciting to see a lot of progress on the anti-spam front, much of it focused on Paul Graham's essay in which he describes a Bayesian technique for detecting spam.

A couple problems with it though:

1) While originally I thought that Paul's algorithm was in no way Bayesian, I have since noticed that if and only if a particular less-than-optimal assumption is made, part of it could be viewed as Bayesian. But it's a pretty abstact case for Bayesianness. In any case, there's no need to dwell on Bayesianness or non-Bayesianness; we have bigger fish to fry. (Note: Tim Peters of spambayes fame has posted another way of looking at Paul's approach as Bayesian, although to do so he needs to make the unrealistic assumption that spams and non-spams are equally likely.)

2) I believe that Paul's algorithm could perform better by changing it to address a couple of problems.

2a) The real justification for Paul's technique comes from this anonymous article. But if you read it, you'll see that the approach there assumes that the probabilities are independent (which they are not when you're talking about words in emails). Since they aren't, the output is not really a probability.

2b) The same anonymous article also makes assumptions that only have a clear relations to to a completely different use of the calculations. In this different use, there are individuals with different degrees of accuracy in making yes-or-no predictions, and the assumption is that the probability of the prediction being correct is the same whether the answer is actually yes or no. It's hard to relate that to the way the calculation is being used in Paul's essay, but even if we can, the assumption about the same probability of correctness in either direction has no particular justification in this context.

2c) Because of those problems, the number coming out of the calculation Paul is using is not really a probability. This has some unpredictable effects. For instance, the .9 cutoff Paul uses means different things depending on how many words are in the email.

2d) Also, Paul's approach is subtly asymmetric with respect to how it handles words that indicate "spamminess" compared to how it handles words indicating "non-spamminess". It is less sensitive in one case than the other. Some people feel that this is good for reducing false positives. But it is a hypothesis of this essay that there is evidence both in favor, and against, a spam classification for any email, and we that get best performance by not lessening our handling both kinds of evidence equally well.

Because I am interested in the spam problem (for one thing, I get 5 or 10 spams per hour) and have a background in working on these kinds of problems, I decided to see if I could improve the algorithm. Also, I wanted to have a solution that could be tested within the code people have already written to use Paul's technique, without a lot of additional effort. So I came up with a very simple alternative approach.

It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.

Here's what you should do.

I am assuming you have used simple counting to create a set of numbers that, while they may not be actual reliable proabilities, are numbers that are at least guesstimates of probabilities. They won't be statistically independent, but that's fine. This is what Paul does; just follow the model from his essay.

Let's call these numbers p1, p2,..., pn for a given email. (Globally, we will refer to the probability guesstimate associated with a given word, based on counts from the database, as p(w).) So for that email there are n words and n associated probability guesstimates p(w), each indicating the approximate likelihood that the containing email is a spam.

Furthermore I am assuming that a value near 1 indicates spamminess and a value near 0 indicates non-spamminess.

So far we aren't doing anything different than Paul described.

Now, calculate

P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n) [spamminess] Q = 1 - (p1*p2*...*pn)^(1/n) [non-spamminess] S = (P - Q) / (P + Q) [combined indicator]
["^(1/n)" should be read as "take the nth root"]

S is a number between -1 and 1. High numbers mean it's spam. Low numbers mean it's not. 0 means there's equal evidence both ways.

If you prefer a number between 0 and 1, just use (1 + S) / 2. (That puts it on the same scale as Paul's approach.)

When you pick your cutoff point for S, it will have consistent meaning for larger or smaller emails, and it will be symmetric in its treatment of spam-indicating and non-spam-indicating words.

And it will treat them in a way that is consistent with a theorem [1] that proves a most-sensitive-possible technique for determining whether there is an underlying tendency for probabilities to go in a particular direction. The connection would take too much time to explain here, but it is a great justification for why this should work really well. (Note to statisticians: the product of the probabilities is monotonic with the Fisher inverse chi-square combined probability technique from meta-analysis. The null hypothesis is that the probabilities are independent and uniformly distributed. I believe that the former hypothesis is violated in ways that have a desirable impact on the output according to whether an email is spam or non-spam, and that the latter is violated in ways that aren't prejudicial to either outcome. We don't need to consider the degrees of freedom in doing a comparison because both the spam and non-spam cases have the same degrees of freedom. Note that this comparison does not actually rely on the assumption that we can invoke the Fisher technique; Fisher is merely tied to the possibility that this is the most sensitive possible way to do this test.)

I think this technique is worth trying. It's simple and should work well. To give it a fair shake though, you will have to take the time to try various cutoff points for S. The most logical first thing to try is 0, unless you've adapted the output to be between 0 and 1 for consistency with Paul's approach. In that case you might start with .5.

Further Improvement 1
Note: this section has been very extensively revised based on feedback from Tim Peters, who pointed out a problem with my original approach when used in a world where people may classifiy all their spams but not bother to classify most of their good emails.

When there is no existing classification data for a word, Paul lets his probability be .4 (in the current version of his essay). He finds that that's a reasonable guesstimate when there's no data. But the same problem exists to a lesser extent in low-data situations such as 1 or 2 data points as well. In such cases, it's true that we have some data, but we don't really have enough to be confident. Obviously we're forced to do something when n is 0 because there is no data; but in reality we should do something when n is any low number -- 0 is just the most extreme case. Handling the case of 0 and ignoring other low-data cases doesn't really make sense.

Bayesian statistics gives us a way to do that... such that all the low-data situations (including the no-data situation) are treated consistently and smoothly.

The first step is to calculate the probability of a word using Paul's existing techniques. Call this p(w) for word w. Let n be the number of data points. (If n=0, paul would substitute .4, but we wouldn't need to do that.)

Let x = the number used when n is 0 (I'll discuss how to compute it below).

Let a be a constant that is tuned for peformance, with a reasonable default value of 1.

Now calculate
a + (n * p(w)) f(w) = --------------- (a / x) + n
f(w) is a probability guesstimate that smoothly handles the case of n=0 and other low-n cases in a consistent and meaningful way.

This calculation can be justified by a Bayesian argument that I won't include here in order to keep things simple. (Let me know if you want to see the justification.)

x can be computed by trial and error as Paul did. However, another approach would be to calculate p(w) for all words that have, say, 10 or more data points, and then use the average.

Full Bayesian Approach
Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above. See also An evaluation of Naive Bayesian anti-spam filtering by Androutsopoulos et. al. And this comment from Tim about the relative merits of the naive Bayesian approach vs. the Graham approach may also be of interest.

If anyone does a performance comparison of the naive Bayesian approach vs. the approach discussed here, I would really love to see it.

In Closing

Please email me, grobinson@transpose.com, with the results of any experiments, or if you have any questions or suggestions.

If you do use some or all of the techniques here for spam detection, all I ask is that you include a link to this page on your site.

[1] Little, Rarmond C., J. Leroy Folks (1971) Asymptotic Optimality of Fisher's Method of Combining Independent Tests. Journal of the American Statistical Association, 336(66), Pp. 802-805

		© Copyright 2006 Gary Robinson. Last update: 1/30/06; 2:48:12 PM.