A few months ago on a LinkedIn discussion group for translators, someone cited an article about testing machine translation (MT). The article “ An Analysis of Google Translate Accuracy” by Milam Aiken and Shipa Balan, reported on the results of their study of Google Translate (GT). They used GT to translate several phrases and then analyzed the quality of the results using different methods. One of the phrases they chose for the testing was “My hovercraft is full of eels,” causing one of the translators in the discussion group to wonder how in the world they came up with their sample phrases.

Monty Python fans may recognize the line from their Hungarian phrasebook sketch set in World War II, which satirized the many bad phrasebooks available. The last scene of the sketch shows a man being sentenced for fraud for selling the phrasebooks that translated “Which way is the train station?” to “May I fondle your buttocks?”

The sketch is very funny, but it is also a reminder of some of the hilarious results that have been obtained from MT over the years. Like this translation that some Israeli journalists obtained from an automated Hebrew-English translation engine and actually sent to the Dutch foreign minister:

Helloh bud, enclosed five of the questions in honor of the foreign minister: The mother your visit in Israel is a sleep to the favor or to the bed your mind on the conflict are Israeli Palestinian.

Testing the “Invisible Idiot”

MT has come a long way even in the few years since that story broke. However, evaluating just how far it has come is not an easy task. The best and most obvious way would seem to be to use bilingual humans to evaluate the results, but that still leaves the question of how to quantify their subjective evaluations. But an even bigger problem is the difficulty and expense of such an attempt. Google Translate currently supports 65 languages at this time. If I remembered my high school math well enough, I could tell you exactly how many language pairs that equals, but I know that it’s huge. How do you even find and coordinate the number of qualified people necessary to perform parallel tests on so many language pairs?

Because of this problem automated methods have been developed to analyze the results of automated translation. Methods like BLEU, NIST, and METEOR use already translated texts and compare the results of the machine translation with the results of the already prepared human translation according to particular criteria.

Then there is the question of how good these evaluation methods are. So then we have studies to compare the automated results with the results of human evaluation.

Whatever the test, the outcome is a score that attempts to quantify how faithful an automated translation is to the original text and how comprehensible it will be to a speaker of the target language. The one thing none of the numbers can tell you, unless they show a 100% match between source and target text, is what was wrong with the translation. A little difference can change a lot of meaning.

This is why, as Common Sense Advisory puts it, “There is a place and a time for machine translation.” It is useful, but by itself it should not be trusted in any situation in which accurate and fluent translation is important.