Sunday, October 15, 2006


Search Doesn't Work: Story 2: NLP means many things. To me it means Natural Language Processing. To others it means neurolinguistic programming. When I search for the bare term 'nlp' in Google, I just get results with the second sense - same for other search engines. If I search for 'William Cohen', the first result on Google is for my friend Prof. William Cohen and the second for the other chap. [...] So why don't I get this for NLP? Why no mixture of results? [... ] Word sense disambiguation is a core requirement for a search engine. The problem - the same text having more than on meaning - can certainly be reduced by the user. However, it seems that there is a great amount of scope that could be explored on the interface side. Google is definitely aware of the problem, which is why results for ambiguous names produce multiple sense results pages, but they (and the other major engines) are way behind systems like Vivisimo's Clusty which produces appropriate results for the NLP problem. (Via Data Mining.)

The problem is that word-sense disambiguation is hard. The Clusty results for "nlp" are a tangle. They get one "natural language" cluster in the middle of a bunch of "neuro-linguistic" clusters, and it's not easy to tease them apart. Overall, Clusty's interface is way too busy, and likely to confuse for all but the most easily disambiguated queries. For example, with my favorite query "transducer", none of the clusters on the first screen are for transducer in the sense of automata theory, even though the second search result is a Wikipedia page for that sense of transducer, while the first search result is the Wikipedia page for the electrical engineering sense of transducer.

One might expect a sense-aware search engine to exploit Wikipedia to recognize alternative senses. Clusty doesn't seem to. I don't know how Clusty works in detail, but the problem is that recognizing alternative senses seems obvious in retrospect but it is hugely difficult to do from scratch, because we don't know what information sources and similarity measures will work in general, rather than in hindsight for a particular case.

Proper name disambiguation is much easier than general disambiguation. It is defensible for a search engine to focus on a limited set of classes that can be disambiguated reliably instead of trying to do the whole job, badly.


9:55:29 PM