


Spam DetectionUpdate: A lot of editing has been done to this piece. I eventually decided that there should be some kind of version history so I posted the version that was current on Sept. 24, 2002 here. Update (Jan 30, 2006): I've moved my blog, where I discuss spam technology and other stuff, to http://www.garyrobinson.net. My Linux Journal article is now online at http://www.linuxjournal.com/article.php?sid=6467. You should definitely check that out if you're interested in the concepts expressed on this page, which are really only a simplified approximation to the version in the article. If you want to implement the algorithm, you should definitely tune the parameters according to Greg Louis' advice. A book, Ending Spam, by Jonathan Zdziarski has been published. This book gives further discussion of the techniques outlined here (and of many other antispam techniques as well). Update (Feb 6, 2004): On my new blog, I've added an entry about the "training to exhaustion" technique that may be of some interest, although it's completely separate from the work described on the page you're reading and in the Linux Journal article. I only mention it here because it seems that some who are interested in the technique described here may want to take a look at the other as well. Update (Mar. 29, 2003): Greg Louis has done more testing, and offers important advice on tuning the parameters for the procedure described here. In another piece, he also describes an improvement based on the chisquare distribution which I suggested in direct communication with the SpamBayes group (and which benefits from a further enhancement suggested by Tim Peters of that group), and which is also described in my Linux Journal article. Update (Feb. 26, 2003): My Linux Journal article is out (March 2003 issue), which goes beyond the article presented here by bringing in a further improvement based on the chisquare distribution. Unfortunately I can't supply a link since it's only in the printed magazine for now. In other developments, Hexamail says their Hexamail Guard filter is based on this work. Update (Dec. 9, 2002): I've written an article on the techniques described here plus a further one involving Fisher's approach to metaanalysis, which will be published in an upcoming issue of Linux Journal. These techniques, together, have beaten naive Bayesian classification and classification based on the Bayesian chain rule in headtohead testing. Due to the constraints of the publishing process, I can't make the article available here until it's published by Linux Journal, but Greg Louis has written up the idea here. Note that there are further enhancements waiting to be tested. One last item for today, I became aware today that SpamAssassin is using some of these ideas. Another update (Nov. 5, 2002): Bogofilter now has the approach described here as a builtin option. It is testing very well against the original approach. A fair amount of testing has been done since the original version of this essay was posted on Sept. 16, 2002 with the results that as of Sept. 26, 2002 the spambayes project has decided to use the algorithms below. The first test that was done that combines the original idea and "Further Improvement 1" is discussed here and it is very positive. In general the spambayes mail list has emerged as the center of testing, so if you're interested, it would make sense to go there. Also, this essay is undergoing extensive revisions as feedback comes in. Where someone has made an important contribution, I'll point it out. It's been exciting to see a lot of progress on the antispam front, much of it focused on Paul Graham's essay in which he describes a Bayesian technique for detecting spam. A couple problems with it though: 1) Paul's approach has become fairly famous for filtering spam in a Bayesian way. That's only true if we make fairly large leaps of the imagination. Originally after reading his issay I thought that it was in no way Bayesian, but I have since noticed that if and only if a particular lessthanoptimal assumption is made, part of it could be viewed as Bayesian through a very obscure argument. But it's a pretty remote case for Bayesianness. In any case, there's no need to dwell on Bayesianness or nonBayesianness; we have bigger fish to fry. (Note: Tim Peters of spambayes fame has posted another way of looking at Paul's approach as Bayesian, although to do so he needs to make the unrealistic assumption that spams and nonspams are equally likely.) 2) I believe that Paul's algorithm could perform better by changing it to address a couple of problems. 2a) The real justification for Paul's technique comes from this anonymous article. But if you read it, you'll see that the approach there assumes that the probabilities are independent (which they are not when you're talking about words in emails). 2b) The same anonymous article also makes assumptions that only have a clear relations to to a completely different use of the calculations. In this different use, there are individuals with different degrees of accuracy in making yesorno predictions, and the assumption is that the probability of the prediction being correct is the same whether the answer is actually yes or no. It's hard to relate that to the way the calculation is being used in Paul's essay, but even if we can, the assumption about the same probability of correctness in either direction has no particular justification in this context. 2c) Because of those problems, the number coming out of the calculation Paul is using is not really a probability. This has some unpredictable effects. For instance, the .9 cutoff Paul uses means different things depending on how many words are in the email. 2d) Also, Paul's approach is subtly asymmetric with respect to how it handles words that indicate "spamminess" compared to how it handles words indicating "nonspamminess". It is less sensitive in one case than the other. Some people feel that this is good for reducing false positives. But it is a hypothesis of this essay that there is evidence both in favor, and against, a spam classification for any email, and we that get best performance by not lessening our handling both kinds of evidence equally well. Testing so far seems to corroberate that assumption. 2e) Paul's approach tends to produce numbers that are so extreme for either classification that there isn't room to adjust a setting for a little more or a little less sensitivity to spamminess. This is a feature many people would like to have. Here's an alternative approach. I am assuming you have used simple counting to create a set of numbers that, while they may not be actual reliable proabilities, are numbers that are at least guesstimates of probabilities. They won't be statistically independent, but that's fine. This is what Paul does; just follow the model from his essay. Let's call these numbers p1, p2,..., pn for a given email. (Globally, we will refer to the probability guesstimate associated with a given word, based on counts from the database, as p(w).) So for that email there are n words and n associated probability guesstimates p(w), each indicating the approximate likelihood that the containing email is a spam. Furthermore I am assuming that a value near 1 indicates spamminess and a value near 0 indicates nonspamminess. So far we aren't doing anything different than Paul described. Now, calculate P = 1  ((1p1)*(1p2)*...*(1pn))^(1/n) [spamminess] Q = 1  (p1*p2*...*pn)^(1/n) [nonspamminess] S = (P  Q) / (P + Q) [combined indicator] ["^(1/n)" should be read as "take the nth root"] S is a number between 1 and 1. High numbers mean it's spam. Low numbers mean it's not. 0 means there's equal evidence both ways. If you prefer a number between 0 and 1, just use (1 + S) / 2. (That puts it on the same scale as Paul's approach.) When you pick your cutoff point for S, it will have consistent meaning for larger or smaller emails, and it will be symmetric in its treatment of spamindicating and nonspamindicating words. Also, there is a middle ground; the user can adjust S to vary the degree of sensitivity to spamminess. Also, this calculation will treat the probability guesstimates in a way that is consistent with a theorem [1] that proves a mostsensitivepossible technique for determining whether there is an underlying tendency for probabilities to go in a particular direction. The connection would take too much time to explain here, but it is a great justification for why this should work really well. (Note to statisticians: the product of the probabilities is monotonic with the Fisher inverse chisquare combined probability technique from metaanalysis. The null hypothesis is that the probabilities are independent and uniformly distributed. I believe that the former hypothesis is violated in ways that have a desirable impact on the output according to whether an email is spam or nonspam, and that the latter is violated in ways that aren't prejudicial to either outcome. We don't need to consider the degrees of freedom in doing a comparison because both the spam and nonspam cases have the same degrees of freedom. Note that this comparison does not actually rely on the assumption that we can invoke the Fisher technique; it is just interesting that we are not too far removed from conditions which would make it arguably the most sensitive possible way to do this test. Our technique does not depend on the assumption that the p's are "really" probabilities; rather they are quantities that correspond to spamminess or nonspamminess.) Further Improvement When there is no existing classification data for a word, Paul lets his probability be .4 (in the current version of his essay). He finds that that's a reasonable guesstimate when there's no data. But the same problem exists to a lesser extent in lowdata situations such as 1 or 2 data points as well. In such cases, it's true that we have some data, but we don't really have enough to be confident. Obviously we're forced to do something when n is 0 because there is no data; but in reality we should do something when n is any low number  0 is just the most extreme case. Handling the case of 0 and ignoring other lowdata cases doesn't really make sense. Bayesian statistics gives us a way to do that... such that all the lowdata situations (including the nodata situation) are treated consistently and smoothly. The first step is to calculate the probability of a word using Paul's existing techniques. Call this p(w) for word w. Let n be the number of data points. (If n=0, paul would substitute .4, but we wouldn't need to do that.) Let x = the number used when n is 0 (I'll discuss how to compute it below). Let s be a constant that is tuned for peformance, with a reasonable default value of 1. Now calculate (s * x) + (n * p(w)) f(w) =  s + n f(w) is a probability guesstimate that smoothly handles the case of n=0 and other lown cases in a consistent and meaningful way. x can be computed by trial and error, with a starting estimate of .5. However, another approach would be to calculate p(w) for all words that have, say, 10 or more data points, and then use the average. If you want to understand this in detail and haven't studied Bayesian stats please refer to this paper: http://citeseer.nj.nec.com/135897.html, which is a tuturial by David Heckerman on Bayesian networks. You only need to read through page 7. From this point on I take it for granted that the reader understands as much Bayesian theory as is communicated by Heckerman in those 7 pages. But even without that understanding the reader will probably be able to get a very rough highlevel view out of this. Let us start in an imaginary world where there are exactly as many spams as nonspams in the training corpus. Then I'll bring it back to the real world at the end. For now, there are the same number of spams and hams. We are going to do a test that is the equivalent of the "experiment" of tossing a coin multiple times to see whether it appears to be biased. There are n trials. If we were tossing a coin, each coin toss would be a "trial" and we'd be counting the number of heads. But in our case, the trial is looking at the next occurrence of "Nigeria" in the training corpus and seeing whether the email it contains is a spam. Clearly, it's a binomial experiment: there are two values, yes or no. And they are independent, at least if you only consider the first appearance of Nigeria in a given email and don't use the other occurrences in that email as trials. (So we'll assume we're doing that and bring it back to Grahamian reality at the end.) Now, it happens that if you have a binomial experiment and assume a beta distribution for the prior, you can express the expectation that the nth + 1 trial will come out "yes" as (s * x) + y . (See Heckerman.) s + n A beta distribution is determined by 2 parameters. I'll call them u and v. The relationship between s, x, u, and v is as follows: s = u + v s * x = u If you want to see the affects of changing s on the beta distribution, just hold x constant and recalculate u and v, then graph it. But actually looking at the beta distribution is just esoterica for our application. The beauty of using s and x rather than u and v in our calcs is that intuitively, we can see s as the "strength" (hence the use of "s") of an original assumed expectation for y relative to how strongly we want to consider our actual collected data. And the x represents that expectation. And when n is 0, the expression directly gives our assumed expectation. So, while u and v more directly describe the particular beta distribution, s and x are much more helpful for our actual use of the calculation. In particular s can be "tuned" according to whatever strength of the prior assumption leads to the best performance, and x can be tuned for whatever assumed Grahamian probability leads to the best performance. If desired, a reasonable value for x can be calculated from the training data by averaging Graham's probability, which we shall call p(w), for all words w. So, suppose we have a corpus with the same number of spams as hams, and we want to calculate an expectation for the word "Nigeria." We start looking through emails, one by one, incrementing n for each email that contains "Nigeria," and incrementing y if that email is a spam. Before we look at our first email, our Bayesian calc gives us our assumed expectation, x: (s * x) + 0 x = . s + 0 As we get more data, the expression moves gradually farther away from x, asymptotically approaching the actual probability that an email containing "Nigeria" is a spam. This is all well and good, except for one thing: we don't have the same number of spams and hams in our training data. The question is, do we want to give more weight to evidence in favour of spamminess or hamminess because of the fact that the particular individual using a system built on this technique might happen to receive more ham than spam (or vice versa)? Graham implicitly assumes not. And I tend to agree. And most importantly, in testing, this technique, with Graham's assumption, seems to "work." Here I will quote from Tim Peters' description of the Grahamian method (by personal email): > suppose Nigeria appears in 50 ham and 50 spam... In [Graham's] > scheme, it also depends on the total number of hams and of spams. Say there > are 100 spams and 200 hams. Then Graham effectively says "OK, it appears in > half of the spams, but only a quarter of the hams, so if I see Nigeria it's > twice as likely to be a spam, so the spamprob is really 2/3" (If you're familiar with Graham's technique, note that we're ignoring the fact that he multiplies the ham counts by 2, which testing is showing is not necessary or helpful when doing things as described in this essay.) Now, when we're using this Grahamian calculation, that's different from the simple count described above for the binomial experiment in order to derive an expectation. However: Let p(w) be the result of the Grahamian calculation described by Tim Peters. It is easy to see that the expression (n * p(w)), where n is the number of "Nigeria" instances looked at, approximates the value the counter I describe above WOULD have had if we actually had the same number of spams and hams (of course this becomes more true the larger the number of spams and hams we have). For instance, in Tim's example, let's assume that instead of 100 spams and 200 hams, there were 150 spams and 150 hams, leaving the total number of emails at the same 300. Then "Nigeria" would have appeared in approximately 75 spams and 37.5 hams. So, our counters would have arrived at approximately the values: n= 112.5 y = 75 (And of course 75/112.5 = 2/3 = .6666.) Let p(w) be the Grahamian probability, .666666. Then our formula would be: (s * x) + 75 f(w) =  s + 112.5 (s * x) + (112.5*.666666) =  s + 112.5 (s * x) + (n * p(w)) =  s + n And we have arrived at the desired expression. Note 1: We could be more rigorous about the derivation of our expression from Graham's. The expectation of ham or spam for Nigeria could both be separately calculated using a binomial random variable with a beta prior. But for corpuses of any size at all, the s and x terms would be completely overwhelmed by the actual data, so there is really no point in going through the exercise of doing that calculation. However, I believe this procedure could be used to derive a fairly rigorous justification of the Grahamiam assumption, which I am not taking the time to do here. Note 2: In calculating p(w) Graham counts every instance of word w in every email in which it appears. Obviously, if a word appears once in an email, there is a greater probability that it will appear again in that same email than if it hadn't already appeared at all. So, the random variable is not really independent under Graham's technique which is one reason why, in the description above, we only count the first occurrence. However, we are pragmatic and whatever works best in practice is what we should do. There is some evidence at this point that using the Graham counting technique leads to slightly better results than using the "pure" technique above. This may because it is not ignoring any of the data. So, p(w) and n should simply be computed the way that gives the best results. Note 3: One could invoke the "naive Bayesian assumption" of independence even where the variables are known to not be independent. That has been proven to be an acceptible assumption in certain contexts usually having to do with classification. I don't think this application meets the requirements for invoking those proofs but I haven't studied that question in detail. Full Bayesian Approach Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above. See also An evaluation of Naive Bayesian antispam filtering by Androutsopoulos et. al. If anyone does a direct performance comparison of the naive Bayesian approach (particularly a boosted one) vs. the approach discussed here, I would really love to see it. However see this comment from Tim about the relative merits of the naive Bayesian approach vs. the Graham approach. Update: such testing has been done against unboosted naive Bayes in the context of Bogofilter, and the technique presented here, with the enhancements from my Linux Journal article and described by Greg Louis were preferred, although I'm still hoping to see a test against boosted naive Bayes. In Closing Please email me, grobinson@goombah.com, with the results of any experiments, or if you have any questions or suggestions. In most cases, if you do use some or all of the techniques here for spam detection, all I ask is that you include a prominent acknowledgment and a link to this page on your site. If the product is a proprietary (nonopensource) software product running on the user's machine I request that the link and acknowledgment be prominent without necessarily being always visible... for instance they could figure prominently on one of the most frequently viewed Help screens or on the About screen. [1] Little, Rarmond C., J. Leroy Folks (1971) Asymptotic Optimality of Fisher's Method of Combining Independent Tests. Journal of the American Statistical Association, 336(66), Pp. 802805 Gary Robinson is CTO of Emergent Music, LLC.

