I've been dabbling in baseball statistics to amuse myself over the summer. As I mentioned in this past post, I find quantitative statistical analysis of baseball engrossing. Win/loss discrimination is one of the hardest, most interesting statistical problems I've ever come across (and I've come across some doozies in particle physics, so it isn't like I am a neophyte to the world of data mining and advanced statistical data analysis methods).
I've been meaning to post a few things here about baseball data analysis, mostly just to document things where I can come back and access them again in one easy-to-find place. Also, there are likely other people out there who might find this collected information useful.
The first bit of useful info is an undergraduate thesis by Kenneth Massey that discusses systems to relatively rank the performance of a group of teams (any kind of team, not just baseball). You can find the thesis here.
Some of what is in Massey's paper is crap. Keep the crap-factor in mind when reading it. What is sound is his description of maximum likelihood based rating systems (I'm going to assume that anyone interested enough to have read down to here knows enough about statistics to know what "maximum likelihood" and "bayesian priors" are...if you are interested in rating systems for sports teams but don't know what "maximum likelihood" means, read a statistics text or go to Wikipedia). Massey's discourse on ML rating systems got me busy designing some likelihood rating systems of my own. For instance, the scores in baseball are Weibull distributed to a good approximation (see http://www.math.brown.edu/~sjmiller/math/papers/PythagWonLoss_Paper.pdf). Thus, to rank teams, I can construct a likelihood function for the prior month of games to a particular matchup that lets the means of the Weibull distributions for each team float for each fit, along with a global "home team advantage" additive factor. I also let the shape parameter of the Weibull distribution, gamma, float in the fit. Then, for a particular matchup between teams A and B I calculate the rating as the normal probability of Z=(mean_A-mean_B)/sqrt(var_A+var_B), where mean_A and var_A are the mean and variance of the Weibull score distribution for team A (and so on for team B). If A and B were exactly matched teams and played many games, this would be a distribution uniformly distributed between 0 and 1. If A was a somewhat more superior team, we would expect some pileup near a rating of 1 in the rating distribution for A. I call this the Absinthe Weibull Rating System.
In Massey's paper he also describes a binomial likelihood ranking system. What I am working on now is a second set of ratings based on the binomial likelihood, but using a Bayesian prior based on the Weibull ratings from the previous games for the two teams in the matchup along with the win/loss information for those past games. I have to think some more about how exactly to optimize the prior.
I am doing these fits using a particle physics numeric minimization package called Minuit (which is the best minimization program out there, bar none...I've tried quite a few) but it is still really slow. I am trying to figure out a way to speed up the computation time for the binomial likelihood since the inter-relationship between the ratings (which isn't an issue in the Weibull fit) really slows things down. At some point soon I'll post the fortran programs that perform the necessary maximum likelihood fits to rank the teams.