 Friday, August 15, 2003

We need to be in New York City (my wife has a conference to go to there). We're in Bangor, Maine now. We were going to fly, but may drive to avoid potential problems at airports due to the blackouts. I made a deal with my wife whereby she'll do a lot of the driving while I try and get some work done on my laptop in the passenger seat. We have a nanny for our kids, and she'll be coming too and will keep the kids as entertained as possible during the drive. Somehow this is not a plan I have confidence will be executed smoothly. Maybe the idea of a two-year-old and a five-year-old sitting in a car for a 9 hour drive has something to do with. (Just kidding. The fact, as most parents know, is that it will be insane!)

I've been working on my chi-square-based statistical approach for text classification lately, which is one reason I haven't been posting as frequently. Up to now, I have done none of the comparative testing on the approach. I've just made suggestions for using the calculations for spam filtering, mostly to Greg Louis of Bogofilter and Tim Peters of Spambayes, and they've been kind enough to try them out and tune them (and Tim made a critical suggestion about the advantage of using the middle range of values for knowing when the classifier is unsure).

But Greg sent me a substantial spam/ham database a while back, and I coded the algorithm so for the first time I have a test bed. (It's written in Python.) I've been able to try out some ideas I've been toying with in my mind for months. In this testing, it appears that they do indeed improve the algorithm. When I got a chance I'll post the specifics.
