Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - DeepSPIN (Deep Learning for Structured Prediction in Natural Language Processing)

Teaser

Deep learning is revolutionizing the field of Natural Language Processing (NLP), with breakthroughs in machine translation, speech recognition, and question answering. New language interfaces (digital assistants, messenger apps, customer service bots) are emerging as the next...

Summary

Deep learning is revolutionizing the field of Natural Language Processing (NLP), with breakthroughs in machine translation, speech recognition, and question answering. New language interfaces (digital assistants, messenger apps, customer service bots) are emerging as the next technologies for seamless, multilingual communication among humans and machines.

From a machine learning perspective, many problems in NLP can be characterized as structured prediction: they involve predicting structurally rich and interdependent output variables. In spite of this, current neural NLP systems ignore the structural complexity of human language, relying on simplistic and error-prone greedy search procedures. This leads to serious mistakes in machine translation, such as words being dropped or named entities mistranslated. More broadly, neural networks are missing the key structural mechanisms for solving complex real-world tasks requiring deep reasoning.

This project attacks these fundamental problems by bringing together deep learning and structured prediction, with a highly disruptive and cross-disciplinary approach. First, we will endow neural networks with a planning mechanism to guide structural search, letting decoders learn the optimal order by which they should operate. This makes a bridge with reinforcement learning and combinatorial optimization. Second, we will develop new ways of automatically inducing latent structure inside the network, making it more expressive, scalable and interpretable. Synergies with probabilistic inference and sparse modeling techniques will be exploited. To complement these two innovations, we will investigate new ways of incorporating weak supervision to reduce the need for labeled data.

Three highly challenging applications will serve as testbeds: machine translation, quality estimation, and dependency parsing. To maximize technological impact, this is done with collaboration with Unbabel, a start-up company in the crowd-sourcing translation industry.

Work performed

\"I present below a summary of activities as per August 2019, including the main results, release code and datasets, and dissemination and training activities.

I structure this summary by splitting our accomplishments per work package.


WP1: A planning mechanism for structural search
==========================================

We proposed and investigated the theoretical properties of a class of loss functions called \"\"Fenchel-Young losses\"\", establishing a connection between sparsity, generalized entropies, and margins. This was published as a paper at AISTATS (Blondel et al. 2019), a collaboration between DeepSPIN members (Vlad Niculae and myself) and Mathieu Blondel, from NTT in Japan. Code is publicly released.

Inspired by this work, we developed a sparse sequence-to-sequence model which is able to output sparse probabilities, in addition to use sparse attention, which we applied to neural machine translation and morphological inflection -- this was published as a paper at ACL, conducted by the PhD student Ben Peters, the post-doc Vlad Niculae, and myself (Peters et al. 2019). Code is publicly released under the DeepSPIN Github repository.

We also investigated strategies to overcome exposure bias, together with the PhD student Tsvetomila Mihaylova, with a paper presented at the Student Research Workshop at ACL on a new variant of scheduled sampling for Transformers (Mihaylova and Martins, 2019).

We are currently conducting experiments towards easy-first sequence-to-sequence models.


WP2: Induction of structure inside the network
=========================================

We proposed a new form of attention that is both sparse and constrained, using it to model fertility in neural machine translation. This was published as a paper at ACL 2018, in collaboration with Chaitanya Malaviya (then a Master student at Carnegie Mellon University) and my Master student Pedro Ferreira.

We also proposed a structured form of sparse attention called SparseMAP, a paper presented at ICML (Niculae et al., ICML 2018). This work was developed in collaboration with teams at Cornell University and NTT in Japan. Vlad Niculae, then a PhD student at Cornell, joined the DeepSPIN project as a post-doc. A follow-up work, focused on the creation of dynamic computation graphs rooted on SparseMAP, was published at EMNLP (Niculae et al., EMNLP 2018). Code is publicly released.

We presented a paper at the EMNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP\'18) reviewing and summarizing this line of research, carried out by the DeepSPIN PhD student Ben Peters and post-doc Vlad Niculae (Peters et al., 2018).

Recently, inspired by our aforementioned work (Blondel et al., 2019; Peters et al. 2019), we participated in the SIGMORPHON shared task on morphological inflection (Peters and Martins, 2019) for which we ranked second and won the SIGMORPHON Interpretability Award. Code is publicly released under the DeepSPIN Github repository.

We investigated a hierarchical sparse attention model for document-level machine translation that is able to select relevant sentences from the context, and relevant words from those sentences. This work was conducted in a collaboration with Sameen Maruf, a student from Monash University (Australia) which I am co-advising with Gholamreza Haffari, leading to a paper at NAACL (Maruf et al., 2019). We also experimented with a context-aware machine translation model used for conversational data, published earlier at the Conference of Machine Translation (WMT): Maruf et al., 2018.

More recently, we developed a new variant of the Transformer architecture which is able to learn the sparsity of its attention heads. This is a forthcoming EMNLP paper (Correia et al. 2019). Code is publicly released under the DeepSPIN Github repository.


WP3:Weak supervision and data-driven regularization
==============================================

We developed a simple and effective approach to automatic post-edit\"

Final results

This is covered in the box above. To sum up:
- New methods for sparse, structured, and constrained attention with gains in accuracy and interpretability.
- New method using hierarchical attention for context-aware machine translation.
- Best system demo paper at ACL 2019 for open-source system OpenKiwi.
- Winning system in shared task for quality estimation at WMT 2019 (for all tracks: word-level, sentence-level and document-level, and all language pairs).
- Winning system in shared task for automatic post-editing at WMT 2019 for English-German.
- Interpretability Prize at SIGMORPHON (morphological inflection shared task).
- Practical method based on transfer learning (using a BERT pre-trained model) which led to state of the art numbers for automatic post-editing for the English-German WMT 2018 dataset.

Website & more info

More info: https://deep-spin.github.io.