Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Consistency of Linguistic Annotation

Thesis title in Czech:	Konzistence lingvistických anotací
Thesis title in English:	Consistency of Linguistic Annotation
Key words:	konzistence anotace, nekonzistence anotace, dobývání chyb, jazykově nezávislé, Universal Dependencies, projekt UD, syntax, morfologie
English key words:	Annotation Consistency, Annotation Inconsistency, Error Mining, Language Independent, Universal Dependencies, UD Project, Syntax, Morphology
Academic year of topic announcement:	2018/2019
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Daniel Zeman, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	05.12.2018
Date of assignment:	04.02.2019
Confirmed by Study dept. on:	31.07.2020
Date and time of defence:	10.09.2020 09:00
Date of electronic submission:	31.07.2020
Date of submission of printed version:	07.01.2020
Date of proceeded defence:	10.09.2020
Opponents:	doc. RNDr. Markéta Lopatková, Ph.D.

Guidelines

Natural language texts manually annotated with linguistic information are indispensable resource for machine learning algorithms. However, even human annotators make occasional errors, or fail to keep consistent approach to borderline situations. The topic of the thesis is to explore methods that automatically identify potential inconsistencies in annotation and, if possible, suggest corrections.

Methods should be as language-neutral as possible. If language-specific or treebank-specific rules are useful, they should be clearly separated from the portable components, and they should be evaluated also separately.

The methods will be tested on one or more treebanks in the Universal Dependencies collection.

Inconsistencies identified by the methods will be manually evaluated on a sample of the data. They will be categorized as annotation errors, unclear cases, problems in the underlying text, and false alarms.

Besides identifying errors, it will be also investigated whether, to what extent and how reliably the errors can be automatically corrected. In particular, whether the correction can be automatically proposed by the tool (as opposed to rules proposed by a human who sees the errors identified by the tool).

References

Marie-Catherine de Marneffe, Matias Grioni, Jenna Kanerva, Filip Ginter (2017): Assessing the Annotation Consistency of the Universal Dependencies Corpora. In: Proceedings of Depling 2017, Pisa, Italy.

Chiara Alzetta, Felice Dell'Orletta, Simonetta Montemagni, Giulia Venturi (2018): Dangerous Relations in Dependency Treebanks. In: Proceedings of TLT 16, Praha, Czechia.

Chiara Alzetta, Felice Dell'Orletta, Simonetta Montemagni, Maria Simi, Giulia Venturi (2018): Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank. In: Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Bruxelles, Belgium.

Héctor Martínez Alonso, Daniel Zeman (2016): Universal Dependencies for the AnCora treebanks. In: Procesamiento del Lenguaje Natural, ISSN 1135-5948, 57, pp. 91-98.

Kira Droganova, Olga Lyashevskaya, Daniel Zeman (2018): Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks. In: Proceedings of TLT 17, Oslo, Norway.