Elizabeth Elwyn Davis and Simon D. Levy
Computer Science Department, Washington and Lee University
This undergraduate senior honors thesis focuses on the problem that machine translators face in choosing the correct translation of a polysemous English noun in a foreign language which differentiates between the noun’s several meanings. For instance, general-purpose internet translators such as BabelFish cannot distinguish between the meanings of the English noun bat, generating the French word for the baseball implement rather than that for the flying mammal - even when provided with such contextual hints as “wings” and “cave.”
Statistical machine translation offers a solution for this issue. By constructing probabilistic models or otherwise associating words with each other, statistical methods can simulate a contextual understanding for the translator. Latent Semantic Analysis (LSA; [Foltz & Laham1998]) is one of the tools which fall into this category.
LSA is a well-developed technique and theory for relating words and meanings by analyzing text corpora. A multidimensional vector represents each word and each sentence or other contextual block, and similarity or disparity of meaning can then be calculated by the relative angles of these vectors. LSA uses singular value decomposition [Golub & Van Loan1996] on matrices constructed from words and passages in the learning texts to reduce the number of dimensions in these vectors. Such a dimensional reduction has been shown to be capable of producing simulations of human contextual associations significantly better than simple proximity frequencies.
Although one of LSA’s greatest potential uses is in dealing with polysemy and synonymy, most studies to date have focused upon its applications in searching, including cross-language document retrieval. In the “Intelligent Essay Assessor” (http://www.knowledge-technologies.com/), a commercial product which evaluates students’ knowledge and writing skill, LSA has also been applied as a vocabulary analyzer. Some research has been done on LSA’s use in translating eastern languages [Kim, Chang, & Zhang2002], the relative scarcity of papers directly concerned with LSA-implemented machine translation suggests that this is still an open field.
This thesis examines LSA’s use in lexical disambiguation, under the hypothesis that a standard automated translation method, such as that used by SYSTRAN’s BabelFish and Google translator, when integrated with an LSA model should produce highly accurate word choice for polysemous nouns through analysis of context. Testing of the LSA model on a small, controlled sample space has produced results that affirm LSA’s ability to differentiate between noun meanings. For example, given any usage of “cave” or “umpire” in the target passage, LSA can determine which meaning of “bat” is appropriate, and translate accordingly.
As the use of a Bayesian probability calculation on a simple co-occurrence frequency table created from the same data has similar disambiguation capabilities, the paper also incorporates comparison of LSA with the Bayesian model. The comparative ease of use and small size of the Bayesian method is weighed against LSA’s broader range, as it associate more words with the polysemous nouns through the manipulation of its matrix. Other aspects of the project which are still in progress include automating the acquisition of training data for the LSA matrix and handling plurals.
In addition, our translation program can be connected to an online translator such as BabelFish for a practical demonstration of its capabilities. Given a sentence, the online translator generates a translation in the target language, which is then examined by the LSA or Bayesian model to determine if any polysemous noun present was interpreted accurately, and to correct the translation if necessary.