Thesis (Selection of subject)Thesis (Selection of subject)(version: 385)
Thesis details
   Login via CAS
Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat
Thesis title in Czech: Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat
Thesis title in English: Creation of a Dependency Treebank for Yoruba using Parallel Data
Key words: závislostní syntax, universal dependencies, jazyky s nedostatečnými zdroji
English key words: dependency parsing, annotation, parallel data, projection, UDPipe, part-of-speech tagging, low-resource
Academic year of topic announcement: 2017/2018
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: doc. RNDr. Daniel Zeman, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 08.03.2018
Date of assignment: 11.03.2018
Confirmed by Study dept. on: 07.08.2018
Date and time of defence: 11.09.2018 09:00
Date of electronic submission:21.07.2018
Date of submission of printed version:20.07.2018
Date of proceeded defence: 11.09.2018
Opponents: Mgr. Rudolf Rosa, Ph.D.
 
 
 
Guidelines
The goal of the thesis is to create a small dependency treebank for Yoruba, a language with very little pre-existing machine-readable resources. The treebank will follow the Universal Dependencies annotation standard; however, certain language-specific guidelines for Yoruba will have to be specified. Known techniques for porting resources from resource-rich languages will be tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data will be verified manually in order to evaluate the annotation quality.
References
* Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11:11–311.

* Daniel Zeman, Philip Resnik (2008): Cross-Language Parser Adaptation between Related Languages. In: IJCNLP 2008 Workshop on NLP for Less Privileged Languages, pp. 35-42, International Institute of Information Technology, Hyderabad, India

* Željko Agić, Dirk Hovy, and Anders Søgaard (2015). If all you have is a bit of the Bible: Learning POS taggers for truly lowresource languages. In The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015).

* Universal Dependencies v2 guidelines (2014-2018): http://universaldependencies.org/
Preliminary scope of work in English
The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data.
Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html