There's an interesting thread on Slashdot this morning about Google's machine translation efforts, based upon some comments at a Google open house last week.
My favorite comment: Anyone care to make a bet that Microsoft will announce a new revolutionary language translation service sometime in the next two weeks or so?
Um... while we don't have a translation web service, we do have what we think is some of the best MT technology out there, and that's already public information. We've had a group working on MT here in Redmond for several years, based upon some technology that came out of our Natural Language research group. We also have a second team in our Beijing lab, working on translation between Asian and Western languages. The MSR MT research teams have published tons of papers on their work (which I would assume Google's MT folks have all read -- shame they don't publish papers on their "research" to give back to the community) including some talking about a successful tech transfer project using the MT system to translate articles from Microsoft's Product Support knowledgebase to other languages.
Which raises an interesting issue, known well to the MT community and hinted at in the slashdot thread: MT quality is directly tied to the quality of the training corpus, and is very domain-dependent. Google apparently is using United Nations transcripts and documents, which means that they will create a system that is potentially very good at translating United Nations speeches and documents. Since reporters, corporate marketing writers, and bloggers rarely write in that style, it's going to have real issues with general web site translation.
You can take that limitation and embrace it, however, as we did with the Product Support Knowledgebase project: we trained the system on knowledgebase articles that had been hand-translated, and then used it to translate more. It worked very well.
It's unfortunate that the thing that will most likely hold up faster progress in machine translation is the existence of corpora of translated materials. They're hard to come by, and expensive to create from scratch.
1:28:38 PM
|