Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat
Thesis title in Czech: | Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat |
---|---|
Thesis title in English: | Creation of a Dependency Treebank for Yoruba using Parallel Data |
Key words: | závislostní syntax, universal dependencies, jazyky s nedostatečnými zdroji |
English key words: | dependency parsing, annotation, parallel data, projection, UDPipe, part-of-speech tagging, low-resource |
Academic year of topic announcement: | 2017/2018 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Daniel Zeman, Ph.D. |
Author: | hidden![]() |
Date of registration: | 08.03.2018 |
Date of assignment: | 11.03.2018 |
Confirmed by Study dept. on: | 07.08.2018 |
Date and time of defence: | 11.09.2018 09:00 |
Date of electronic submission: | 21.07.2018 |
Date of submission of printed version: | 20.07.2018 |
Date of proceeded defence: | 11.09.2018 |
Opponents: | Mgr. Rudolf Rosa, Ph.D. |
Guidelines |
The goal of the thesis is to create a small dependency treebank for Yoruba, a language with very little pre-existing machine-readable resources. The treebank will follow the Universal Dependencies annotation standard; however, certain language-specific guidelines for Yoruba will have to be specified. Known techniques for porting resources from resource-rich languages will be tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data will be verified manually in order to evaluate the annotation quality. |
References |
* Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11:11–311.
* Daniel Zeman, Philip Resnik (2008): Cross-Language Parser Adaptation between Related Languages. In: IJCNLP 2008 Workshop on NLP for Less Privileged Languages, pp. 35-42, International Institute of Information Technology, Hyderabad, India * Željko Agić, Dirk Hovy, and Anders Søgaard (2015). If all you have is a bit of the Bible: Learning POS taggers for truly lowresource languages. In The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015). * Universal Dependencies v2 guidelines (2014-2018): http://universaldependencies.org/ |
Preliminary scope of work in English |
The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data.
Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe. |