Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - HimL (Health in my Language)

Teaser

The goal of the HimL project is to increase timely access to important public health information by makingit available to consumers in their own language. We do this using high quality machine translationoptimised for semantic fidelity. This addresses two distinct but related...

Summary

The goal of the HimL project is to increase timely access to important public health information by making
it available to consumers in their own language. We do this using high quality machine translation
optimised for semantic fidelity. This addresses two distinct but related problems:

1) A large proportion of public health information is local: While general information about diabetes may
be available in Polish, Romanian or Urdu, what a Polish speaker in Aberdeen, or a Romanian speaker in
Edinburgh or an Urdu speaker in Glasgow needs is local information about diabetes services and care,
but in their own language.
2) Best practice in public health care can change, as a result of new research studies and/or meta-analyses.
Recommendations in line with best practice need to be made available to consumers, as soon as possible,
through timely, accurate translation into their own language.

The first problem reflects the translation needs of national and regional public health services such as our partner
NHS 24, while the second reflects the translation needs of trans-national health information providers such as
our partner COCHRANE. We will address these needs by applying recent advances in machine translation (MT)
to make their texts available in a timely fashion to a much wider range of language communities.
Fully automatic translation systems will be created to translate public health information from English into
Czech, German, Polish and Romanian. This particular set of languages have been chosen because of the needs
of our user partners, and because they cover the three major families of European languages (Slavic, Germanic
and Romance). What is more, all four of these languages are classified as having “weak or no support” or
“fragmentary support” according to the META-NET language white papers.

The innovations in MT that we adapt, integrate and apply in the project include:
1) Domain adaptation to build MT systems tailored to the public health domain in terms of terminology,
register and reading level;
2) Semantically–aware MT to improve translation fidelity, along with semantic evaluation to
tune the systems for fidelity and serve as specific automated progress metrics;
3) Morphology prediction to translate from morphologically impoverished English to the morphologically
richer target languages considered in this project.

Through the life of the project we will transfer the above improvements in MT from lab-based systems, to live
on-line health care services which are highly trusted and have large numbers of users. This project will allow
these multi-lingual services to expand coverage to more content and to new languages, becoming more widely
useful to European healthcare consumers. The HimL project places a great deal of emphasis on the deployment,
evaluation and dissemination of current MT research and the user partners are committed to proving both the
usefulness and the impact of the project, by running extensive user acceptance testing, and collecting web traffic
and web user feedback.

In summary, the implementational objectives of the project are:
1) Collate the latest research on high accuracy machine translation, to develop systems which are measur-
ably more reliable, for our particular domain, than baseline state-of-the-art models.
2) Deploy translation engines as services with a simple interface and scalable performance.
3) Integrate translation functionality seamlessly into the content management workflow of two high-profile
on-line healthcare information providers.
4) Add translation functionality to their websites, carefully managing user expectation and evaluating user
satisfaction.
5) Comprehensively measuring the impact of this new functionality on the services provided by NHS24 and
COCHRANE .

By achieving these more concrete objectives we will be taking steps towards achieving our global objec-
tives:
1) Increase the accuracy of machine translation, making it more reliable and more wi

Work performed

\"The workplan for HimL envisaged a system release in each year of the project, which would be integrated into each of the partner\'s websites by the end of the third quarter,
then evaluated in the final quarter. We planned a phased incorporation of new technologies into the releases, evaluating the innovations as are integrated.

At the time of the report, we have already integrated and evaluated the Year 1 (Y1) systems and are currently working on
the integration of the Year 2 (Y2) systems. The Y2 systems incorporate domain adaptation in each of the language pairs,
morphologogical processing for Czech and German, the ``core fidelity\'\' technique to reduce the occurrence of clear
semantic errors. Our aim for this year\'s system integration is to have all these components working together correctly
in the translation server, so that they can be used to translate the user partner\'s websites. The translation is
incorporated into the publication process employed by each of these partners, although at this stage of the project
the translations only appear on non-public, development versions of the sites.

Our evaluation for Y1 trialled a new human-assisted method for semantic machine translation, aiming to measure how much meaning is
preserved in the translation, and pin-pointing the errors in the translation.
Building on our experiences with this trial, we will develop the method further in Y2, using it to compare different types of
system. We will also apply other types of evaluation, both human and automatic, to measure our progress between Y1 and Y2, and
inform system development for the final year.

Behind the system building and integration efforts, the HimL partners have been developing and selecting technologies that could
be used in the Y3 systems. In the \"\"data and adaptation\"\" work package, our main focus so far has been on harvesting suitable training data,
and choosing the best techniques for making sure that the translations produced are appropriate for the domain (i.e., adapted). In
the second half of the project we will also turn our attention to how best to use the large amounts of monolingual medical text
to enhance our translation systems. In the \"\"semantics\"\" work package we have been working on various techniques to discourage the systems
from producing semantically incorrect translations, for example by automatically removing the source of such translations from the models
altogether, exploiting automatic semantic anaylsers, and techniques for dealing correctly with negation. In the \"\"morphology\"\"
work package we have been improving our models for prediction of correct morphology in German and Czech, and extending them to
Romanian and Polish.

Since the HimL project began there has been an important development in machine translation research which we are tracking
carefully. We refer to the emergence of \"\"neural network\"\" or \"\"deep learning\"\" models for machine translation, known as
\"\"neural machine translation\"\" (NMT). In evaluation campaigns (where researchers compare their systems against others on standard
data sets) in 2015 and 2016, NMT systems have in many cases out-performed earlier types of MT systems. It is still early days
for NMT, and there are many practical problems in deploying such systems, as well as questions about how well they will perform
on specific domains, and their potential for biasing towards fluency at the expense of adequacy. However the field is moving
very rapidly and HimL cannot afford to ignore this development. We are already building NMT systems with HimL data sets to evaluate
against existing MT systems.
~\"

Final results

\"Within HimL, we aim to make progress beyond state-of-the-art in five areas, which we describe below.

Data and Adaptation. We aim to create translation systems which are tailored for public health
information text. To do this we gather all possible resources (starting with those collected for the previous EU project, Khresmoi)
that could be useful in this domain. We supplement these resources with general purpose texts commonly
used for translation systems, such as the European parliamentary proceedings (europarl) etc.
Since these general purpose texts are much larger
than the in-domain texts we need to apply domain-adaptation techniques to prevent incorrect senses from
being preferred over the correct in-domain senses in the translation output. The biggest problem with lack of
in-domain data, however, is generally out-of-vocabulary words, particularly with regard to technical terms. To
address this problem we employ additional terminology resources wherever possible and supplement these
by mining terms from non-parallel texts.

Semantically Motivated MT. Central to our approach is the idea that translation of public health
information needs to preserve the meaning of the source sentence, even if this may mean sacrificing some
fluency. To this end, we apply recent research from QTLeap and other EU and non-EU projects, and use
robust semantic evaluation metrics to validate our approaches. We incorporate shallow semantic
parsing and semantic role labelling (SRL) into the translation systems, in order to ensure that translations which
do not preserve the \"\"who did what to whom\"\" and appropriate polarity are penalised by the model. We also
incorporate fidelity checks into the system using shallow syntactic information to reduce predictable errors.
Finally, we improve lexical semantics through existing large-scale high-quality dictionaries.

Morphology. The target languages in HimL (German, Czech, Polish and Romanian) all exhibit a
degree of morphological complexity not found in the source language (English) and all possess case systems.
In order to generate accurate translations in these languages, therefore, we need to have mechanisms in place to
ensure that the correct morphological variants are chosen. To achieve this, we refine and apply techniques
mainly developed by LMU Munich and Charles University for use in German and Czech, to the 4 HimL target languages.
These techniques include both the corrective approach (depfix) and the two-step approach, where we first
translate to a simplified target language representation, then use a prediction model to generate the correct
morphology.

Deployment. Staged deployment of our lab-based models is led by Lingea. They are implementing a simple API,
similar to Google’s translation API, which is directly used by content management
systems of the NHS 24 and Cochrane, to ensure a tight and seamless translation functionality to the live web-
sites. On the client side, NHS 24 and Cochrane ensure that the multi-lingual translation functionality
is easy to use and that user expectations about the quality of machine translation are appropriately managed.
Once new functionality is deployed, we will run a marketing campaign which will create enough traffic to
gather feedback for evaluation. Engaging with the public in this way will also help us to gain insight into the
reactions and attitudes to MT content in this domain.

Evaluation. In the evaluation package, our aims are to measure the effectiveness of our improve-
ments in MT, as well as to feed back diagnostic information on the MT systems we create. Evaluation
ranges from fully automatic, to full-blown trials of MT \"\"in the wild\"\". We are developing new automatic and
human-assisted semantic metrics to assist us in tuning our systems towards system fidelity, based on previous
work from Edinburgh and Charles University in this area. We also participate in relevant open evaluation campaigns
to benchmark our research in MT in the publi\"

Website & more info

More info: http://www.himl.eu.