linguistic data consortium logoIn 2005, the Linguistic Data Consortium, a linguistic research institute housed at the University of Pennsylvania, was awarded a contract with the Defense Advanced Research Projects Agency (DARPA) of the U.S. Department of Defense to conduct research under its GALE (Global Autonomous Language Exploitation). LDC’s task was providing language data to support the development and evaluation of automated translation software in Arabic and other languages. In order to do this, LDC needed large quantities of parallel texts in each language pair to use in training the translation software. The texts had to be professionally translated at high levels of quality and also had to comply with specific rules for consistency and formatting to make them suitable for their purpose. After being evaluated, MTM LinguaSoft received a one-year subcontract to produce high quality Arabic to English translations to be used in the research.

The Approach:

The translations had high word volumes and usually needed to be completed in short time frames. The guidelines also specified that each text had to be translated into English by a native speaker of the source language and, then, proofread by a native speaker of English. Linguists also had to carefully follow an extensive list of rules, specified by the client, for translating and handling the files. The volume of the work also meant that the translation could not be handled by a single pair of linguists. Instead, the project manager (PM) had to recruit several teams of translators and proofreaders to spread out the work and to allow for back-up. Furthermore, the linguists chosen had to be thoroughly evaluated not only on their language skills, but also on their attention to detail and ability to follow the guidelines.

The texts to be translated were provided by LDC in the form of numerous text files of varying lengths and taken from a variety of sources such as the news media, discussion forums and blog entries, transcribed news or transcribed phone conversations. The PM parcelled out the files among the translation teams and carefully tracked their progress to meet deadlines. We put in place QA procedures and scripts to help us ensure consistency and compliance with the guidelines. LDC reviewed the translations and gave periodic feedback and ratings. Based on these ratings and the PM’s own observations, the PM constantly provided the linguists with detailed feedback. The suitability of linguists for the project was constantly reevaluated and new recruits were sought both to replace linguists who did not live up to expectations and to provide for an ever-increasing volume of work.

This feedback from the client also validated our expertise and quality level.

The Result:

The evidence that the client was pleased with the results is that MTM LinguaSoft’s one-year contract was subsequently renewed for four more years—the entire span of the 5-year GALE project; annual volume grew to over half a million words by the end of the period; and MTM LinguaSoft’s translation work for LDC has expanded into Chinese and Urdu. Since the end of GALE, we have continued to perform substantial translation work for LDC under other government projects such as BOLT (Broad Operational Language Translation) and RATS (Robust Automatic Transcription of Speech )—most recently some prototype projects in and from Uzbek and Turkish. Data included newswire, twitter feeds, and online forums. We have also provided transcription of Arabic-, Urdu-, and Farsi-language audio files for use in voice-recognition and translation software research. In fact, and we’ve managed a large portion of the LDC’s translation and transcription needs over the past eight years.