Fresh Tracks

Sunday, December 28, 2003

I was enjoying the beginning of Crowded with Genius but I was getting confused with some references to Edinburgh's complex topography, especially in the discussion of the battle of Prestonpans. I was partly attributing this to my fuzzy memory of the city, but the matter was puzzling enough that I dug for our 1970s maps of the city and environs. I concluded that in at least two places, the author refers to "west" when in reality it must be "east". In particular, the site of the battle is East of where the Jacobite troops were quartered, not West. So what, one might wonder. Mistakes are made. But it does worry the reader about the accuracy of other descriptions less easy to check.

I've been noticing increasing mistakes in otherwise interesting books, and also in the New York Times. They range from the trivial "principle" for "principal" to describing an individual as if for the first time when the individual was introduced in the preceding page, to the opposite (common now in the Times) of referring to an individual as if they had been previously introduced, to inaccuracies like the one above in direction (or number, date, name, ...). Unintended consequences of how easy it is now to do partial revision and cut and paste without ever reviewing the whole text?
12:54:39 PM

At first sniff, the actual scents discussed in Hanna's posting could not be more concrete. But Mark reminds us that knowledge of smells, and other kinds may be more mediated by language than we admit. Alright, so Mark doesn't say that, but I think it is implicit. I may know "mildew" smell first hand. Or maybe not, I might only know the Portuguese "mofo," from my mother and grandmother commenting on the results of winter wetness, incomplete drying of clothes, or of ironing that was not hot enough to evaporate the water in a fabric. And learned, by means I don't recall, that "mildew" means the same. Or maybe just something similar. Who knows? Maybe the fungi that create those smells are different in Portugal and in... In where? England, Scotland, Boston? And how do I know the taste of a madeleine dunked in tea? The Portuguese counterpart of madeleine is likely to be somewhat different from whatever Proust had in mind; as for the blends, concentration, and serving temperatures of tea, the possibilities are mind-boggling.

Surprising as it might seem to outsiders, this question is central to modern computational linguistics. One side will argue that without perceptual grounding, anything we glean from texts is a poor, fake proxy. The other feels that the grounding of much of the language we use, especially that pertaining to social and technical topics, is other language. While the discussion is fun for conference coffee breaks and especially the bar after the meeting, it might be more profitable to explore it experimentally as Mark suggests:
Anyhow, Demeter's list of currently available fragrances suggests a problem in computational linguistics: devise an automatic algorithm that analyzes a very large text corpus to derive a comparable list of "names of things with evocative smells". (In fact one should be able to do better, since Demeter's list is not really very long, systematically omits highly offensive smells likecat piss and rotten eggs, and includes some odorless oddities like holy water ... ) This problem in itself is not important, but it's an instance of an interesting class. It would be nice, for instance, to be able to process biomedical text so as to derive a list of names of structural proteins, or diseases of domestic pets, or insects implicated as disease vectors, or whatever. [Language Log]
Current information-extraction techniques based on labeling a bunch of documents and learning pattern matchers from the examples take less advantage than we'd like from co-occurrence statistics. Some research [Riloff; Cucerzan and Yarowsky; Collins and Singer; Cohen and Fan; ...] suggests that one can do much better using lots of unlabeled data, but at present those techniques are a black art: sometimes they work, sometimes they don't, and it's not yet clear why. I think that part of the problem is that existing techniques focus on just one kind of entity and very superficial features, while the way we learn that CPEB may denote a kind of protein involves seeing the term used in relation to several other terms, themselves belonging to rich terminological networks of which we have some knowledge.
12:14:38 PM