Absinthe

Absinthe
Living my life as an exclamation, not an explanation...

It should be noted by readers that Absinthe is not a lawyer, and anything posted in this blog should not be used as a substitute for professional advice from a lawyer

Home

The AAUW Legal Advocacy Fund

Adventures in Ethics and Science

Feminist Law Professors

Leaving Academia

A Natural Scientist

Rants of a Feminist Engineer

Ross' Employment Law Blog

Thus Spake Zuska

Wake Up APS Physics

11D: Leave saving the world to men? I don't think so

Sunday, July 06, 2008

Perl scripts to parse baseball data from the internet, and a useful program

I've developed a perl script that spiders internet sites and parses them to obtain baseball data such as scores, team era, rbi, hitting average, etc for past matchups from present day back to 1999. It outputs the data to a file that you can read in with other programs (see below). The script also parses moneyline odds information. The script can be found here. It can be freely shared as long as the introduction lines containing the author name (ie; my name), creation date, and copyright information are included. The script runs through roughly one year of data in a 24 hour period. Do not adjust the spider sleep time to less than what it is in the script unless you want to have your IP address barred from spidering the sites this script spiders (you've been forewarned).

A faster way to get the past game data is just to go to http://www.retrosheet.org and get their game logs for past years. Then, if you still need the money line odds information for each game, you can use this script to spider that information, then intermesh it with the retrosheet data. This odds-parsing perl script runs much quicker than the perl script that parses the game data.

Don't write me to ask me how to run perl scripts if you don't know how.

I have a C++ program that reads in the files output by the first perl script, or reads in retrosheet game logs and interleaves it with the output from the perl script that just parses odds data. The main program can be found here, and the files containing the various utility methods can be found here and here. This program can be freely shared as long as the file containing the copyright information and name of the author (ie; me again) is included. the main program has a logical switch that allows you to change between reading retrosheet data, and data from the perl scripts in the above links.

Poke around in the various methods in the utilities file to look at the ways the data can be manipulated. Right now averages of various statistics are calculated for each team for the past N games and then output to an external file that I read into R for further manipulation.

Don't write me asking how to compile and run C++ programs if you don't know how.

Don't write me asking for documentation for either the perl scripts or the C++ program; you're reading it right now.

Do write me if you find bugs.

None of the files I've just shared with you are guaranteed to be bug free.

8:16:07 PM    comment []

Rating systems for baseball teams

I've been dabbling in baseball statistics to amuse myself over the summer. As I mentioned in this past post, I find quantitative statistical analysis of baseball engrossing.  Win/loss discrimination is one of the hardest, most interesting statistical problems I've ever come across (and I've come across some doozies in particle physics, so it isn't like I am a neophyte to the world of data mining and advanced statistical data analysis methods).

I've been meaning to post a few things here about baseball data analysis, mostly just to document things where I can come back and access them again in one easy-to-find place. Also, there are likely other people out there who might find this collected information useful.

The first bit of useful info is an undergraduate thesis by Kenneth Massey that discusses systems to relatively rank the performance of a group of teams (any kind of team, not just baseball). You can find the thesis here.

Some of what is in Massey's paper is crap. Keep the crap-factor in mind when reading it. What is sound is his description of maximum likelihood based rating systems (I'm going to assume that anyone interested enough to have read down to here knows enough about statistics to know what "maximum likelihood" and "bayesian priors" are...if you are interested in rating systems for sports teams but don't know what "maximum likelihood" means, read a statistics text or go to Wikipedia).  Massey's discourse on ML rating systems got me busy designing some likelihood rating systems of my own. For instance, the scores in baseball are Weibull distributed to a good approximation (see http://www.math.brown.edu/~sjmiller/math/papers/PythagWonLoss_Paper.pdf).  Thus, to rank teams, I can construct a likelihood function for the prior month of games to a particular matchup that lets the means of the Weibull distributions for each team float for each fit, along with a global "home team advantage" additive factor. I also let the shape parameter of the Weibull distribution, gamma, float in the fit. Then, for a particular matchup between teams A and B I calculate the rating as the normal probability of Z=(mean_A-mean_B)/sqrt(var_A+var_B), where mean_A and var_A are the mean and variance of the Weibull score distribution for team A (and so on for team B). If A and B were exactly matched teams and played many games, this would be a distribution uniformly distributed between 0 and 1. If A was a somewhat more superior team, we would expect some pileup near a rating of 1 in the rating distribution for A.     I call this the Absinthe Weibull Rating System.

In Massey's paper he also describes a binomial likelihood ranking system. What I am working on now is a second set of ratings based on the binomial likelihood, but using a Bayesian prior based on the Weibull ratings from the previous games for the two teams in the matchup along with the win/loss information for those past games. I have to think some more about how exactly to optimize the prior.

I am doing these fits using a particle physics numeric minimization package called Minuit (which is the best minimization program out there, bar none...I've tried quite a few) but it is still really slow. I am trying to figure out a way to speed up the computation time for the binomial likelihood since the inter-relationship between the ratings (which isn't an issue in the Weibull fit) really slows things down. At some point soon I'll post the fortran programs that perform the necessary maximum likelihood fits to rank the teams.

5:26:22 PM    comment []