Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - LEXICAL (Lexical Acquisition Across Languages)

Teaser

Due to the growing volume of textual information available in multiple languages, there is a great demand for Natural Language Processing (NLP) techniques that can automatically process and manage multi-lingual texts, supporting information access and communication in core...

Summary

Due to the growing volume of textual information available in multiple languages, there is a great demand for Natural Language Processing (NLP) techniques that can automatically process and manage multi-lingual texts, supporting information access and communication in core areas of society (e.g. healthcare, business, science). Many NLP tasks and applications rely on task-specific lexicons (e.g. dictionaries, word classifications) for optimal performance. Recently, automatic acquisition of lexicons from relevant texts has proved a promising, cost-effective alternative to manual lexicography. It has the potential to considerably enhance the viability and portability of NLP technology both within and across languages. However, this approach has been explored for a very small number of resource-rich languages only, leaving the vast majority of worlds’ languages without useful technology. The ambitious goal of this project is to take research in lexical acquisition to the level where it can support multi-lingual NLP, involving also languages for which no parallel language resources (e.g. corpora, knowledge resources) are available. Building on an emerging line of research which uses mainly naturally occurring supervision (connections between languages) to guide cross-lingual NLP, we will develop a radically novel approach to lexical acquisition. This approach will transfer lexical knowledge from one language to another as well as will learn it simultaneously for a diverse set of languages using new methodology based on guiding joint learning and inference with rich knowledge about cross-lingual connections. We not only aim to create next generation lexical acquisition technology but also aim to take cross-lingual NLP a big step toward to the direction where it is no longer dependent on parallel resources. We will use our approach to support fundamental tasks and applications aimed at broadening the global reach of NLP to areas where it is now critically needed.

Work performed

During the first 18 months, we have focused on developing the methodological basis for the project. Our main focus has been on WP1 (the development of an improved basic model for mono-lingual lexical acquisition, but since there are inter-dependencies between the different WPs we have also conducted groundwork on WP2 (Transfer of lexical information from resource-rich to resource-poor languages) and on WP3 (Joint multilingual lexical acquisition).

Much of our research has centered around representation learning – in particular learning distributed word representations (i.e. word embeddings, termed WEs henceforth) from text data. Now widely used to benefit Natural Language Processing (NLP), we can expect such features to play a key role in the project. In particular, we have worked on improving the quality of both monolingual and cross-lingual WE learning, in direction where this can best serve the project needs.

As a starting point, we have created novel, improved resources that can provide the means for more accurate, detailed and cognitively plausible evaluation of WE learning. We have developed 1) Simverb-3500, a resource that provides human ratings for the similarity of 3,500 verb pairs, 2) HyperLex, a resources that quantifies the extent of the semantic category membership and lexical entailment (LE) relation between 2,616 noun concept pairs, and 3) a novel framework that enables large-scale evaluation of representation learning architectures in the free word association task. We are currently in the process of creating similar resources for evaluation of multiple languages. We also co-organised RepEval 2016, the first Workshop on Evaluation Vector Space Representations for NLP at ACL 2016 – an event specifically aimed at encouraging the development of improved evaluations for representation learning.

In terms of methodology, the project has developed improved techniques for learning both mono- and cross-lingual WEs. For example, we have investigated the role of seed lexicons in inducing a shared bilingual word embedding space. Our results have demonstrated that a shared bi-lingual WE space can be induced by leveraging only a very weak bilingual signal along with monolingual data – an approach which is more realistic for real life application and can better contribute to meeting the aims of this project. We have also investigated the use of visual representations in supporting bilingual lexicon learning and have proposed a simple and effective multi-modal approach that learns bilingual semantic representations by fusing linguistic and visual input. Our results show that such a multi-modal approach can yield clear performance gains. Additionally, we have studied the problem of bilingual lexicon induction in a setting where some translation resources are available, but unknown translations are sought for certain, possibly domain-specific terminology. We have shown that word- and character-level representations can each (independently and in combination) improve state-of-the-art results for this task.

Furthermore, we have investigated the importance of syntactic information in WE learning. Such investigations are particularly relevant for the learning of verb WEs which are central to the project. In English, dependency-based contexts had been shown to perform better than the more common but less informed lexical contexts. In our cross-linguistic comparison of different context types, such contexts proved useful for detecting functional similarity (e.g., verb similarity, solving syntactic analogies), but not as clearly as previously reported on English. We developed a novel cross-lingual word representation model which injects syntactic information into a shared cross-lingual word vector space. Our experiments with several language pairs on word similarity and bilingual lexicon induction demonstrate the usefulness of the proposed syntactically informed cross-lingual word vector spaces.

Our on-going research lo

Final results

This project will push the frontiers of our understanding of language processing as well as extend our ability to apply NLP across languages. We will develop a novel, expressive modeling approach where joint learning and inference across languages is guided with rich knowledge about cross-lingual connections. This is a groundbreaking contribution since multilingualism is one of the biggest current challenges in NLP. In addition to handling multiple languages, our model will also handle complex, syntactic-semantic linguistic knowledge, advancing the capabilities of joint learning and inference models of language. We focus on a core component at the heart of many NLP systems: the lexicon. Rich verb lexicons that link together the syntax and semantics of verbs provide the effective means to deal with many challenges (e.g. ambiguity, noise, data sparsity) in NLP. They are important for the many applications that benefit from information related to the predicate-argument structure. We will provide the improved means to tune and create such resources automatically (i.e. in a cost-effective manner) and will extend this approach to resource-poor languages and domains. Our techniques will support more accurate prediction of the appropriate interpretation of text within languages and improve our ability to match syntactic and semantic variations across languages for applications such as machine translation. Ultimately, improved automatic information processing is beneficial to communication and can support key areas of society (e.g. science, healthcare, trade). We aim to extend these benefits to global level. Our project will also provide rich material for theoretical investigations because it brings together insights about language connections and probabilistic verb knowledge in data. This can benefit linguistic and cognitive sciences and can also lead to improvements in language education (e.g. second language learning).

Website & more info

More info: http://ltl.mml.cam.ac.uk/projects/lexical/.