Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - QT21 (QT21: Quality Translation 21)

Teaser

A European Digital Single Market free of barriers, including language barriers, is a stated EU objective to be achieved by 2020. The findings of the META-NET Language White Papers show that currently only 3 of the EU-27 languages enjoy moderate to good support by our machine...

Summary

A European Digital Single Market free of barriers, including language barriers, is a stated EU objective to be achieved by 2020. The findings of the META-NET Language White Papers show that currently only 3 of the EU-27 languages enjoy moderate to good support by our machine translation technologies, with either weak (at best fragmentary) or no support for the vast majority of the EU-27 languages. This lack is a key obstacle impeding the free flow of people, information and trade in the European Digital Single Market. Many of the languages not supported by our current technologies show common traits: they are morphologically complex, with free and diverse word order. Often there are not enough training resources and/or processing tools. Together this results in drastic drops in translation quality. The combined challenges of linguistic phenomena and resource scenarios have created a large and under-explored grey area in the language technology map of European languages. Combining support from key stakeholders, QT21 addresses this grey area developing
(1) substantially improved statistical and machine-learning based translation models for challenging languages and resource scenarios,
(2) improved evaluation and continuous learning from mistakes, guided by a systematic analysis of quality barriers, informed by human translators,
(3) all with a strong focus on scalability, to ensure that learning and decoding with these models is efficient and that reliance on data (annotated or not) is minimised.
To continuously measure progress, and to provide a platform for sharing and collaboration (QT21 internally and beyond), the project revolves around a series of Shared Tasks, for maximum impact co-organised with WMT.

Work performed

1) Work performed
During this period, the project produced 11 deliverables and wrote 93 different scientific papers from which 27 are system papers.
The scientific work performed by QT21 has benn done along three axis, a) semantics (WP1), b) morphology and low resources (WP2), c) continuous learning from humans (WP3).
In order to measure progress and compare with the state of the art, QT21 co-organises and sponsors WMT (Workshop on Machine Translation http://statmt.org/wmt16/) which goal is to benchmark and measure Machine Translation (MT) on different tasks (WP4).


In WP1, we develop models that overcome the problems occurring in syntactically and semantically divergent languages that cannot be adequately addressed by purely statistical models by:
o Structuring the translation differently, not along shallow phrases but rather along the semantics of the sentence (Task 1.1).
o Improving handling of semantics (the expressed relations between events, participants and other elements in the sentence) in existing shallow models (Task 1.2).
o Better learning in MT, both for existing models as well as by introducing novel models that learn full structural prediction (Task 1.3).
WP1 produced 3 deliverables in this reporting period: D1.1, D1.3 and D1.5.
The work in WP1 lead to the publication of 31 papers.

Work package 2 (WP2) addresses the problem of translating under-resourced and morphological rich languages. Even though tested only on the 6 QT21 language pairs, the methodology developed in this work package will make it possible to use MT for more languages and new MT based applications will be deploy-able.
Many methods addressing the challenges of this WP were already evaluated. Some of these techniques were already presented in major conferences in Natural Language Processing (NLP) and MT and were also part of submissions to open evaluation campaigns like the Conference on Statistical Machine Translation (WMT) shared task, partly organised by QT21.
The work package is organised around four tasks:
• T2.1: Morphology-aware word representations. We addressed the problem of large vocabularies in morphological rich languages by investigating new word representations. Instead of using words as atomic units, morphemes and factored representations were utilised to represent words during word alignment in phrase-based and syntax-based machine translation as well as in neural network models.
• T2.2 Modelling morphological agreement. In this task, we addressed the problem of modelling morphological agreement in statistical machine translation. In this period, new approaches to address the intra-sentence agreements between words as well as inter-sentence agreements such as anaphora were investigated.
• T2.3: Improved usage of parallel data through machine learning. In order to make better use of parallel data for under-resourced languages, a tighter integration of neural network models into the overall machine translation system was investigated. Furthermore, discriminative training methods to assess specific translation needs were investigated.
• T2.4: Exploiting other data sources for under-resourced languages. To improve the translation of under-resourced languages, new data sources have to be exploited. In this period we achieved this by using cross-lingual transfer of dependency annotations. Furthermore, looking at language independent approaches, we explored the use of a pivot language in order to learn sentence representation in a language neutral way but also the use of deciphering algorithms.
WP2 produced one deliverable D2.1
The work in WP2 lead to publication of 25 articles.

The objective of WP3 is to access and use the information provided by human feedback in various forms (human translation, MT error annotation and post-edits) for various MT systems and language pairs. This will make it possible to profile issues arising from machine translation and apply this knowledge to automatically improve the output of the sy

Final results

QT21 leads MT technology development. The results from WMT16, the reference benchmark event on Machine Translation show that QT21 won 2/3 of all tasks, even being significantly better than the known online MT systems on En↔De, En↔Cz and En→Ro, the core languages from QT21. This is a first.
Also the traction the QT21 harmonised MQM-DQF error annotation paradigm is experiencing from the industry shows the impact of QT21 work in the translation industry.

New State Of The Art (SOTA) results have been obtained with “back translation”, Byte Pair Encoding (BPE) and to some extent System Combination (Ensemble).
Automatic Post Editing (APE) has for the first time proven being a very interesting approach by improving BLEU scores by more than 2.5 points. Further QT21 APE\'s system improves translations by 1 to 2 BLEU points on data that is not annotated (not using annotation information).

Automatic Post Editing has improved by 2 BLEU points with class based Language Models and factored models.

Website & more info

More info: http://www.qt21.eu.