Consistency of Linguistic Annotation
Thesis title in Czech: | Konzistence lingvistických anotací |
---|---|
Thesis title in English: | Consistency of Linguistic Annotation |
Key words: | konzistence anotace, nekonzistence anotace, dobývání chyb, jazykově nezávislé, Universal Dependencies, projekt UD, syntax, morfologie |
English key words: | Annotation Consistency, Annotation Inconsistency, Error Mining, Language Independent, Universal Dependencies, UD Project, Syntax, Morphology |
Academic year of topic announcement: | 2018/2019 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Daniel Zeman, Ph.D. |
Author: | hidden![]() |
Date of registration: | 05.12.2018 |
Date of assignment: | 04.02.2019 |
Confirmed by Study dept. on: | 31.07.2020 |
Date and time of defence: | 10.09.2020 09:00 |
Date of electronic submission: | 31.07.2020 |
Date of submission of printed version: | 07.01.2020 |
Date of proceeded defence: | 10.09.2020 |
Opponents: | doc. RNDr. Markéta Lopatková, Ph.D. |
Guidelines |
Natural language texts manually annotated with linguistic information are indispensable resource for machine learning algorithms. However, even human annotators make occasional errors, or fail to keep consistent approach to borderline situations. The topic of the thesis is to explore methods that automatically identify potential inconsistencies in annotation and, if possible, suggest corrections.
Methods should be as language-neutral as possible. If language-specific or treebank-specific rules are useful, they should be clearly separated from the portable components, and they should be evaluated also separately. The methods will be tested on one or more treebanks in the Universal Dependencies collection. Inconsistencies identified by the methods will be manually evaluated on a sample of the data. They will be categorized as annotation errors, unclear cases, problems in the underlying text, and false alarms. Besides identifying errors, it will be also investigated whether, to what extent and how reliably the errors can be automatically corrected. In particular, whether the correction can be automatically proposed by the tool (as opposed to rules proposed by a human who sees the errors identified by the tool). |
References |
Marie-Catherine de Marneffe, Matias Grioni, Jenna Kanerva, Filip Ginter (2017): Assessing the Annotation Consistency of the Universal Dependencies Corpora. In: Proceedings of Depling 2017, Pisa, Italy.
Chiara Alzetta, Felice Dell'Orletta, Simonetta Montemagni, Giulia Venturi (2018): Dangerous Relations in Dependency Treebanks. In: Proceedings of TLT 16, Praha, Czechia. Chiara Alzetta, Felice Dell'Orletta, Simonetta Montemagni, Maria Simi, Giulia Venturi (2018): Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank. In: Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Bruxelles, Belgium. Héctor Martínez Alonso, Daniel Zeman (2016): Universal Dependencies for the AnCora treebanks. In: Procesamiento del Lenguaje Natural, ISSN 1135-5948, 57, pp. 91-98. Kira Droganova, Olga Lyashevskaya, Daniel Zeman (2018): Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks. In: Proceedings of TLT 17, Oslo, Norway. |