deeje.com

Tuesday, February 11, 2003

The Code Journal

Here I'll post items concerning writing software

6:07:45 PM

Bayesian and Latent Semantic analyses demystified

Everyone is talking about bayesian filters being applied to spam, but I continue to believe that they can be used in interesting ways around weblogs and the blogosphere!
Very good, cogent explanation of Bayesian and Latent Semantic analysis techniques, which are means whereby a computer is asked to "understand" a document so that it can be automatically classified. Both techniques are being widely hailed as the great code hope of spam-filtering.
Latent semantic analysis (or indexing) is an application of what's called principal components analysis (PCA), or factors analysis, to the domain of information organization. In the basic version, you form a big 2-D matrix with documents (e-mails for instance) along one axis and terms (word, phrases) along the other, and fill in the entries with a 0 when the term doesn't occur in the document, and with a 1 (or count) when it does. Then you take the resulting monstrous matrix and grind it up with an algorithm that finds covariance patterns. That's to say, the associations of words "latent' in the document base you feed in are going to be found. Shovel in several weeks worth of news stories and it's going to be obvious that 'Saddam' and 'Iraq' are highly correlated, or 'Tiger' and 'golf'. The method actually kicks out a transformation matrix into which you can feed the terms observed in a particular document, and get out a score for that document in terms of "warness" or "golfness" - those are principal components, or factors. You compute and save as many factors as you want - presumably less than the number of original terms. (Apologies to any wandering mathematicians for the gross simplications.)
Link Discuss

(via JOHO the Blog) [Boing Boing]

10:38:49 AM