Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - MMT (MMT will deliver a language independent commercial online translation service based on a new open-source machine translation distributed architecture)

Teaser

Artificial intelligence is going to be the next big thing of our next future and will bring humans to a new era of access to and organisation of information. Language translation is probably the most complex human task for a machine to learn, but it is also the one with the...

Summary

Artificial intelligence is going to be the next big thing of our next future and will bring humans to a new era of access to and organisation of information. Language translation is probably the most complex human task for a machine to learn, but it is also the one with the greatest potential to make a positive difference in the world.

MMT aims at making a contribution to the evolution of machine translation. Our goal is to consolidate the current state-of-the-art technology into a single easy-to-use product, evolving it and keeping it open to integrate the next greatest opportunities in machine intelligence, like deep learning.

To achieve our goals, we need a better machine translation technology that is able to extract more information from data, adapt to context and that is easy to deploy. Just like any other artificial intelligence technology, MMT needs data and we are also focussing our efforts on making all the world’s translated information available to all.

More specifically, MMT will overcome four technology barriers that still hinder the wide adoption of currently available machine translation software by end-users and language service providers:

- MMT will be a ready to install application that will not require any initial training phase and will continuously learn from user data and corrections. Model training and system building is a lengthy procedure. Currently, building a new system from scratch takes a significant amount of time before the system is ready-to-translate. MMT will be a ready to install application that will not require any initial training phase. Once fed with training data, an MMT system will be ready to translate. MMT will merge translation memory and machine translation technology into one single product. Quality of translations will increase as soon as new training data are added from users or simply from the user post-editing activity. This will have an impact on the translation industry by making machine translation applicable to a larger number of translation projects whereas today it only meets the needs of longer-term projects.
- The MMT system will manage context automatically so that it will not require building domain-specific systems. It is well known that machine translation quality improves if the machine translation system is tuned to the domain of the document to translate, but this operation leads to data sparsity. However, the adaptation is currently a separate step in system building that requires considerable time and expertise to be effective. The MMT system will manage context automatically so that it will not require building domain-specific systems. The MMT system will provide the best translation quality achievable for any topic/domain by storing training segments together with context linking information.
- MMT will enable scalability of data and users so that no more expensive ad-hoc hardware installations are needed. MT requires considerable resources in terms of hardware. MMT will enable scalability of data and users so that no more expensive ad-hoc hardware installations are needed. The MMT architecture will support high performance and linear scalability up to thousands of nodes. The same software will work to set-up a personal translation system or to create a web-based service on a cluster of commodity nodes able to handle terabytes of data and millions of users.
- MMT will create a data collection infrastructure that accelerates the process of filling the data gap between large web companies and the machine translation industry. Machine translation requires very large amounts of data to work well. MMT will create a data collection infrastructure that accelerates the process of filling the data gap between large IT companies and the machine translation industry. MMT will leverage the web-crawled data starting from CommonCrawl, the TAUS Data Cloud, Translated’s MyMemory and MateCat data and facilities, to set-up a processing pipeline that will create unpre

Work performed

In Y1 the main focus of the MMT consortium was the integration of all the parts of a standard phrase-based translation engine into a single, easy-to-use product and, of course, the enhancement of such product with the ability to adapt translation output to the context of the source document.

With this technology, the main goal for Y2 has been the identification of a market for the MMT product. For this reason we put much effort in field testing with big web companies such as PayPal and LinkedIn, and small LSPs and professional translators. The result of this activity lead us to the definition of the new features of MMT and the realization of the first product on the market that integrates our technology: the MyMemory Plugin for SDL Studio 2015/2017.

From the technical point of view, the main goal of the development team has been the implementation of the online-learning feature, that is the ability of MMT to include new data in real-time, single contributions or entire TMX files, without the need to retrain the whole engine.

By the end of March and beginning of June 2016, the Consortium released respectively MMT v0.12 and v0.13 with a more solid infrastructure than previous version 0.11.1. We introduced TMX support, monolingual data for a better Language Model, faster tuning, more accurate adaptive models and a brand new functionality called Tag Projection: this new API allows MMT to transfer XML tags from source to target even if the translation has been post-edited.

On October 25th, we released MMT v0.14-alpha: an alpha release that showcased the brand new online-learning feature of our translation engine. This version was used in a presentation at AMTA 2016. By the beginning of 2017, we finally released MMT v0.14 and v0.14.1, a faster and more robust version of MMT with online-learning capability.
The MMT product and its source code are available online at:
https://github.com/ModernMT/MMT.

Final results

The MMT consortium has taken a product-driven approach to the development of the technology. This approach is focused on quickly creating early releases of the product so as to let industry stakeholders and potential users work on the MMT product as early on during the development phase as possible.

This rapid development cycle allows the MMT consortium to put early versions of the product to the test with the goal of collecting information on the efficiency of the technology developed and of understanding how to actually solve real business problems.

Version 0.11 of the MMT product shows great potential when it comes to overcoming some of the limitations of current machine translation technology. This version already features a scalable infrastructure which can handle large number of users and volume of data, is capable of taking context into consideration when generating automatic translations and supports a larger number of languages than current state-of-the-art software.

With version 0.14 of MMT product the consortium has introduced one of the most important and innovative feature in the Statistical Machine Translation field: the ability for the translation engine to process new data in real-time, merging the new information into its existing models. We call this feature online-learning. The realization of such feature unlocked the possibility to offer MMT as an end user product, addressing professional translators needs.

Website & more info

More info: http://www.modernmt.eu/.