Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - WFL (Morphology beyond inflection. Building a wordformation based dictionary for Latin)

Teaser

In the past two decades there has been a considerable increase in the creation of computational linguistic resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for...

Summary

In the past two decades there has been a considerable increase in the creation of computational linguistic resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for modern languages.
However, among the existing linguistic resources, we currently lack, for Latin (and Ancient Greek - and indeed even for the majority of modern languages), a morphological derivational dictionary that connects lexical elements on the basis of Word Formation Rules (WFRs). In linguistics, there are two kinds of morphological rules: 1. inflectional, which relate to different forms of the same lexeme (i.e. singular vs. plural, present vs. past tense); 2. word formation rules, which relate to different lexemes, e.g. love vs. lover. There are two types of word formation processes in Latin: derivation and compounding. Derivation can be further split into: 1. Affixation, where one or more morphemes, called affixes, can be attached to the base of a word. Affixation can be of two types, and can involve (or not) a change of part of speech: a. Prefixation: where the affix is attached before the base. b. Suffixation: where the affix is attached after the base. 2. Conversion, where the derived word incurs only in a change of part of speech without the addition of any affix.
Compounding is the formation of a new lexeme from two or more lexemes.

The WFL project has consisted in the compilation of a derivational morphological lexicon of the Latin language, Word Formation Latin (WFL), which connects lexical elements on the basis of word formation rules (WFRs), through the use of computational linguistic methods. The resulting lexicon has been integrated into the most recent version of the morphological analyser and lemmatiser for Latin Lemlat (www.lemlat3.eu), and can be browsed in its own dedicated website at http://wfl.marginalia.it.

Enriching textual data with derivational morphology tagging promises to provide strong outcomes. It can organise the lexicon at higher level than words, by building word formation based sets of lexemes sharing a common ancestor. Moreover, information on word formation can act as a bridge between morphology and semantics, since core semantic properties are shared at different levels by words built by a common word formation process.

The scope of WFL is to assign a WFR to each morphologically-complex lexeme (i.e. one word morphologically derived from another word) and to link each complex lexeme to its ancestor. All those lexemes that share a common (not derived) ancestor belong tothe same “word formation family”. For instance, the noun bellatrix ‘she who wages war’, the verb rebello ‘to revolt, rebel’, and the adjective bellicosus ‘fond of war’ all belong to the word formation family whose ancestor is noun bellum ‘war’. The semi-automatic insertion of lemmas into the WFL database establishes input-output relations for a set of lemmas matching the features that characterise each WFR.

Work performed

The proposed project intended to:
1) enrich an existing morphological analyser for the Latin language, Lemlat, with word formation information. The newest version of Lemlat can be downloaded at https://github.com/CIRCSE/LEMLAT3.
2) integrate the data within an interface (Word Manager2) that has been already applied to other modern languages (English, German, Italian). This task was implemented, by building a web application for querying WFL that resembles the structure of data as recorded in Word Manager. Since Word Manager is a not freely available platform and making a web version of it is time-consuming and not efficient, the fellow and her supervisor agreed that it was more useful to offer the results of their work in a free, easy-to-access and built from scratch web platform;
3) integrate the information extracted from the resulting derivational morphological dictionary into the morphological layer of annotation the Index Thomisticus Treebank (IT-TB).
4) offer the results of the project work via a user-friendly project website (http://wfl.marginalia.it) which will display the derivational morphological dictionary through a web based search interface.

Final results

The process involved in the construction of the lexicon was highly interdisciplinary. It drew competences from the research environments of Classical languages, theoretical and computational linguistics, NLP, informatics and digital humanities. The innovative potential of this kind of research is to be found in this interdisciplinarity, because it favours the collaboration between the world of the information sciences and the humanities, which are still too often separated in current research. The resource created has a wide potential of application on textual data, of which its application on the IT-TB is only an example.

During the compilation of the WFL database, the resource was used for linguistic research that resulted in conference paper presentations by the fellow and other scholars (see for example Litta, Eleonora, Marco Passarotti, and Paolo Ruffolo. 2017. Node Formation: Using Networks to Inspect Productivity in Affixal Derivation in Classical Latin. In Proceedings of the 2Nd International Conference on Digital Access to Textual Cultural Heritage, 103-108. New York, NY, USA: ACM. Doi:10.1145/3078081.3078092 and Marco Budassi and Marco Passarotti. 2017. -sc- Latin Verbs and Derivation. A Large-scale Exploration and Formal Analysis, presented at the International Colloqium on Latin Linguistics 2017, see the conference Book of Abstracts http://icll2017.badw.de/fileadmin/user_upload/Files/ICLL/Book_of_Abstracts_ICLL_2017.pdf, 18-19).
The potential impact of the resource resides mainly in the fact that a lexicon of this kind for the first time collected empirical data on word formation that before could only be hypothesised about. Word formation relations between lemmas were not previously described in any lexical resource, digital or not digital. Such a resource is of outmost importance towards a further advancement on textual analysis, because it allows to access textual data not only by single word (raw text) or by lemma (lemmatised text), but also by morphological families and wordformation. This has important repercussions that go beyond morphology into semantics, because words that have the same word formation have a common semantic core.

The final workshop has given the opportunity for researchers working on similar projects on different languages (French, German, Croatian, Czech) to gather and establish a relationship that has brought them to look into finding funding opportunities for creating a link, and potentially a standard, between projects, so that they can be connected and talk to each other. Because Latin has been the language mostly used all over Europe until the modern era, we believe WFL can represent a sort of basis to which all other resources can be connected.

Website & more info

More info: http://wfl.marginalia.it.