6 minute read
META AI DEVELOPS MODEL to translate across 200 different languages, including 55 African languages
Driving inclusion through the power of AI translation
AI Research For Real-World Application
Applying AI Techniques to Facebook and Instagram for translation of low-resource languages
We’re committed to bringing people together. That’s why we’re using modeling techniques and learnings from our NLLB research to improve translations of low-resource languages on Facebook and Instagram.
By applying these techniques and learnings to our production translation systems, people will be able to make more authentic, more meaningful connections in their preferred or native languages. In the future, we hope to extend our learnings from NLLB to more Meta apps.
Real-World Application
Building for an inclusive metaverse: A translated metaverse: bringing people together on a global scale
As we build for the metaverse, integrating real-time AR/VR text translation in hundreds of languages is a priority. Our aim is to set a new standard of inclusion—where someday everyone can have access to virtual-world content, devices and experiences, with the ability to communicate with anyone, in any language in the metaverse. And over time, bring people together on a global scale.
Real-World Application
Translating Wikipedia for everyone: Helping volunteer editors make information available in more languages
The technology behind the NLLB-200 model, now available through the Wikimedia Foundation’s Content Translation tool, is supporting Wikipedia editors as they translate information into their native and preferred languages. Wikipedia editors are using the technology to more efficiently translate and edit articles originating in other underrepresented languages, such as Luganda and Icelandic. This helps to make more knowledge available in more languages for Wikipedia readers around the world. The open-source NLLB-200 model will also help researchers and interested Wikipedia editor communities build on our work.
Experience The Tech
Stories Told Through Translation: books from around the world translated into hundreds of languages
Experience the power of AI translation with Stories Told Through Translation, our demo that uses the latest AI advancements from the No Language Left Behind project. This demo translates books from their languages of origin such as Indonesian, Somali and Burmese, into more languages for readers— with hundreds available in the coming months. Through this initiative, the NLLB-200 will be the first-ever AI model able to translate literature at this scale.
The Tech
Machine translation explained: How does the open-source NLLB model directly translate 200 languages?
Stage 1: Automatic dataset construction Training data is collected containing sentences in the input language and desired output language.
Stage 2: Training
After creating aligned training data for thousands of training directions, this data is fed into our model training pipeline. These models are made up of two parts: the encoder, which converts the input sentence into an internal vector representation; and the decoder, which takes this internal vector representation and accurately generates the output sentence. By training on millions of example translations, models learn to generate more accurate translations.
Stage 3: Evaluation
Finally, we evaluate our model against a human-translated set of sentence translations to confirm that we are satisfied with the translation quality. This includes detecting and filtering out profanity and other offensive content through the use of toxicity lists we build for all supported languages. The result is a well-trained model that can directly translate a language.
The Innovations
The science behind the breakthrough
Most of today’s machine translation (MT) models work for mid- to high-resource languages—leaving most low-resource languages behind. Meta AI researchers are addressing this issue with three significant AI innovations.
Automatic dataset construction for low-resource languages
The context
MT is a supervised learning task, which means the model needs data to learn from. Example translations from open-source data collections are often used. Our solution is to automatically construct translation pairs by pairing sentences in different collections of monolingual documents.
The challenge
The LASER models used for this dataset creation process primarily support midto high-resource languages, making it impossible to produce accurate translation pairs for low-resource languages.
The innovation
We solved this by investing in a teacherstudent training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.
Modeling 200 languages
The context
Multilingual MT systems have been improved upon over bilingual systems. This is due to their ability to enable "transfer" from language pairs with plenty of training data, to other languages with fewer training resources.
The challenge
Jointly training hundreds of language pairs together has its disadvantages, as the same model must represent increasingly large numbers of languages with the same number of parameters. This is an issue when the dataset sizes are imbalanced, as it can cause overfitting.
The innovation
We’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, we used self-supervised learning and large-scale data augmentation through multiple types of back translation.
Evaluating translation quality
The context
To know if a translation produced by our model meets our quality standards, we must evaluate it.
The challenge
Machine translation models are typically evaluated by comparing machine-translated sentences with human translations, however for many languages, reliable translation data is not available. So accurate evaluations are not possible.
The innovation
We extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of our translations.
Learn more about the science behind NLLB by reading our whitepaper and blog, and by downloading the model to help us take this project further.
The Journey
Research milestones
Meta AI has been advancing Machine Translation technology while successfully overcoming numerous industry challenges along the way—from the unavailability of data for low-resource languages to translation quality and accuracy. Our journey continues, as we drive inclusion through the power of AI translation.
About No Language Left Behind
No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. It aims to give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences.