|
NEW RANT |
|
  |
|
  |
WHO'S THIS ROBINSON GUY? |
|
|
  |
RANTS |
|
|
|
  |
BLOGROLLING |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Update: I have removed "Further Improvement 2" because it looks like real-world data is messy enough that it is doubtful that it will offer much benefit without a lot more work.
A fair amount of testing has been done since this essay was posted on Sept.
16, 2002 and the results look promising. The first test that was done that combines the original idea and "Further Improvement 1" is discussed here and it is very positive. In general the spambayes mail list has emerged as the center of testing, so if you're interested, it would make sense to go there. Also, this essay is undergoing extensive revisions as feedback comes in. Where someone has made an important contribution, I'll point it out.
It's been exciting to see a lot of progress on the anti-spam front, much
of it focused on Paul Graham's
essay in which he describes a Bayesian technique for detecting spam.
A couple problems with it though:
1) While originally I thought that Paul's algorithm was in no way Bayesian,
I have since noticed that if and only if a particular less-than-optimal assumption
is made, part of it could be viewed as Bayesian. But it's a pretty abstact
case for Bayesianness. In any case, there's no need to dwell on Bayesianness
or non-Bayesianness; we have bigger fish to fry. (Note: Tim Peters of spambayes fame has posted another way of looking at Paul's approach as Bayesian, although to do so he needs to make the unrealistic assumption that spams and non-spams are equally likely.)
2) I believe that Paul's algorithm could perform better by changing it
to address a couple of problems.
2a) The real justification for Paul's technique comes from this anonymous article.
But if you read it, you'll see that the approach there assumes that the probabilities
are independent (which they are not when you're talking about words in emails).
Since they aren't, the output is not really a probability.
2b) The same anonymous article also makes assumptions that only
have a clear relations to to a completely different use of the calculations.
In this different use, there are individuals with different degrees of accuracy
in making yes-or-no predictions, and the assumption is that the probability
of the prediction being correct is the same whether the answer is actually
yes or no. It's hard to relate that to the way the calculation is being used
in Paul's essay, but even if we can, the assumption about the same probability
of correctness in either direction has no particular justification in this
context.
2c) Because of those problems, the number coming out of the calculation
Paul is using is not really a probability. This has some unpredictable effects.
For instance, the .9 cutoff Paul uses means different things depending on
how many words are in the email.
2d) Also, Paul's approach is subtly asymmetric with respect to how it
handles words that indicate "spamminess" compared to how it handles words
indicating "non-spamminess". It is less sensitive in one case than the other.
Some people feel that this is good for reducing false positives. But it is
a hypothesis of this essay that there is evidence both in favor, and against,
a spam classification for any email, and we that get best performance by
not lessening our handling both kinds of evidence equally well.
Because I am interested in the spam problem (for one thing, I get 5 or
10 spams per hour) and have a background in working on these kinds of problems,
I decided to see if I could improve the algorithm. Also, I wanted to have
a solution that could be tested within the code people have already written
to use Paul's technique, without a lot of additional effort. So I came up
with a very simple alternative approach.
It is untested as of now. It is based purely on theoretical reasoning.
If anyone wants to try and it test it in comparison to other techniques,
I'd be very interested in hearing the outcome.
Here's what you should do.
I am assuming you have used simple counting to create a set of numbers
that, while they may not be actual reliable proabilities, are numbers that
are at least guesstimates of probabilities. They won't be statistically independent,
but that's fine. This is what Paul does; just follow the model from his essay.
Let's call these numbers p1, p2,..., pn for a given email. (Globally,
we will refer to the probability guesstimate associated with a given word,
based on counts from the database, as p(w).) So for that email there are
n words and n associated probability guesstimates p(w), each indicating
the approximate likelihood that the containing email is a spam.
Furthermore I am assuming that a value near 1 indicates spamminess and
a value near 0 indicates non-spamminess.
So far we aren't doing anything different than Paul described.
Now, calculate
P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n)
[spamminess]
Q = 1 - (p1*p2*...*pn)^(1/n)
[non-spamminess]
S = (P - Q) / (P + Q)
[combined indicator]
["^(1/n)" should be read as "take the nth root"]
S is a number between -1 and 1. High numbers mean it's spam. Low numbers
mean it's not. 0 means there's equal evidence both ways.
If you prefer a number between 0 and 1, just use (1 + S) / 2. (That puts
it on the same scale as Paul's approach.)
When you pick your cutoff point for S, it will have consistent meaning
for larger or smaller emails, and it will be symmetric in its treatment of
spam-indicating and non-spam-indicating words.
And it will treat them in a way that is consistent with a theorem [1]
that proves a most-sensitive-possible technique for determining whether
there is an underlying tendency for probabilities to go in a particular direction.
The connection would take too much time to explain here, but it is a great
justification for why this should work really well. (Note to statisticians:
the product of the probabilities is monotonic with the Fisher inverse chi-square
combined probability technique from meta-analysis. The null hypothesis is
that the probabilities are independent and uniformly distributed. I believe
that the former hypothesis is violated in ways that have a desirable impact
on the output according to whether an email is spam or non-spam, and that
the latter is violated in ways that aren't prejudicial to either outcome.
We don't need to consider the degrees of freedom in doing a comparison because
both the spam and non-spam cases have the same degrees of freedom. Note that
this comparison does not actually rely on the assumption that we can invoke
the Fisher technique; Fisher is merely tied to the possibility that this
is the most sensitive possible way to do this test.)
I think this technique is worth trying. It's simple and should work well.
To give it a fair shake though, you will have to take the time to try various
cutoff points for S. The most logical first thing to try is 0, unless you've
adapted the output to be between 0 and 1 for consistency with Paul's approach.
In that case you might start with .5.
Further Improvement 1
Note: this section has been very extensively revised based on feedback
from Tim Peters,
who pointed out a problem with my original approach when used in a world
where people may classifiy all their spams but not bother to classify most
of their good emails.
When there is no existing classification data for a word, Paul lets his probability
be .4 (in the current version of his essay). He finds that that's a reasonable
guesstimate when there's no data. But the same problem exists to a lesser
extent in low-data situations such as 1 or 2 data points as well. In
such cases, it's true that we have some data, but we don't really have enough
to be confident. Obviously we're forced to do something when n is 0 because
there is no data; but in reality we should do something when n is any low
number -- 0 is just the most extreme case. Handling the case of 0 and ignoring
other low-data cases doesn't really make sense.
Bayesian statistics gives us a way to do that... such that all the low-data
situations (including the no-data situation) are treated consistently and
smoothly.
The first step is to calculate the probability of a word using Paul's existing
techniques. Call this p(w) for word w. Let n be the number of data points.
(If n=0, paul would substitute .4, but we wouldn't need to do that.)
Let x = the number used when n is 0 (I'll discuss how to compute it below).
Let a be a constant that is tuned for peformance, with a reasonable default
value of 1.
Now calculate
a + (n * p(w))
f(w) = ---------------
(a / x) + n
f(w) is a probability guesstimate that smoothly handles the case of n=0 and
other low-n cases in a consistent and meaningful way.
This calculation can be justified by a Bayesian argument that I won't include
here in order to keep things simple. (Let me know if you want to see the
justification.)
x can be computed by trial and error as Paul did. However, another approach
would be to calculate p(w) for all words that have, say, 10 or more data
points, and then use the average.
Full Bayesian Approach
Finally it is worth mentioning that if you really want to go a 100% Bayesian
route, try Charles Elkan's
paper, "Naive Bayesian Learning". That approach should work very well
but is a good deal more complicated than the approach described above. See
also An evaluation of Naive
Bayesian anti-spam filtering by Androutsopoulos et. al. And this comment from Tim about the relative merits of the naive Bayesian approach vs. the Graham approach may also be of interest.
If anyone does a performance comparison of the naive Bayesian approach vs.
the approach discussed here, I would really love to see it.
In Closing
Please email me, grobinson@transpose.com, with the results of any experiments,
or if you have any questions or suggestions.
If you do use some or all of the techniques here for spam detection,
all I ask is that you include a link to this page on your site.
[1] Little, Rarmond C., J. Leroy Folks (1971) Asymptotic Optimality of
Fisher's Method of Combining Independent Tests. Journal of the American Statistical
Association, 336(66), Pp. 802-805
 |
 |
© Copyright 2006 Gary Robinson.
Last update: 1/30/06; 2:48:12 PM.
|
 |
|
|