Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Multilingual Natural Language Processing - NPFL120

Title:	Mnohojazyčné počítačové zpracování jazyka
Guaranteed by:	Institute of Formal and Applied Linguistics (32-UFAL)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2019
Semester:	summer
E-Credits:	3
Hours per week, examination:	summer s.:1/1, MC [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	taught
Language:	English, Czech
Teaching methods:	full-time
Teaching methods:	full-time
Additional information:	http://ufal.mff.cuni.cz/courses/npfl120

Guarantor:	RNDr. Daniel Zeman, Ph.D. Mgr. Rudolf Rosa, Ph.D. doc. RNDr. Ondřej Bojar, Ph.D.

Opinion survey results Examination dates SS schedule Noticeboard

Annotation -

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (03.05.2019)

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated. The course has the form of a practical seminar in the computer lab.

Course completion requirements -

Last update: Mgr. Rudolf Rosa, Ph.D. (13.10.2023)

Active participation in the classes and submission of homework assignments in the form of source codes and text reports are required to complete the course. The quality of the homework solutions will be reflected in the final grade.

Scoring of assignments:

each assignment is graded with 0 to 5 points

3 points correspond to a basic solution, fulfilling the assignment but not doing much more

4 points correspond to a solution that is in some way better than the basic completion of the assignment (e.g. the solution is particularly clever, goes beyond the assignment, contains a deeper analysis, provides interesting insights, etc.); a solution that is exceptionally good in this sense may receive 5 points

2 points correspond to a solution that has some shortcomings (e.g. it only solves part of the assignment, it solves a different problem than what was assigned, the approach is not correct, the interpretation of the results is not correct...); a solution that has very serious or extensive shortcomings may receive 1 point

0 points correspond to an unsolved homework

The final grade is given by the average of all the assignments:

1 for an average of at least 3.0

2 for an average of at least 2.5

3 for an average of at least 2.0

4 for an average below 2.0

Literature -

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (29.01.2019)

HASPELMATH, Martin. The world atlas of language structures. Oxford University Press, 2005.

PETROV, Slav; DAS, Dipanjan; MCDONALD, Ryan. A universal part-of-speech tagset. In: LREC. 2012. p. 2089-2096.

ZEMAN, Daniel. Reusable Tagset Conversion Using Tagset Drivers. In: LREC. 2008. p. 213-218.

ZEMAN, Daniel, et al. HamleDT: To Parse or Not to Parse?. In: LREC. 2012. p. 2735-2741.

MCDONALD, Ryan; LERMAN, Kevin; PEREIRA, Fernando. Multilingual dependency analysis with a two-stage discriminative parser. In: CoNLL. 2006. p. 216-220.

NIVRE, Joakim, et al. Universal dependencies v1: A multilingual treebank collection. In: LREC. 2016. p. 1659-1666.

DAS, Dipanjan; PETROV, Slav. Unsupervised part-of-speech tagging with bilingual graph-based projections. In: ACL-HLT. 2011. p. 600-609.

ZEMAN, Daniel; RESNIK, Philip. Cross-Language Parser Adaptation between Related Languages. In: IJCNLP. 2008. p. 35-42.

TIEDEMANN, Jörg. Parallel Data, Tools and Interfaces in OPUS. In: LREC. 2012. p. 2214-2218.

AGIĆ, Željko; HOVY, Dirk; SØGAARD, Anders. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In: ACL-IJCNLP. 2015. p. 268-272.

AGIĆ, Željko, et al. Multilingual projection for parsing truly low-resource languages. In: TACL. 2016. 301-312.

SØGAARD, Anders. Data point selection for cross-language adaptation of dependency parsers. In: ACL-HLT. 2011. p. 682-686.

TIEDEMANN, Jörg; AGIĆ, Željko; NIVRE, Joakim. Treebank translation for cross-lingual parser induction. In: CoNLL. 2014. p. 130-140.

FORCADA, Mikel L., et al. Apertium: a free/open-source platform for rule-based machine translation. In: Machine translation. 2011. p. 127-144.

JOHNSON, Melvin, et al. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. In: arXiv preprint arXiv:1611.04558. 2016.

ZEMAN, Daniel: The World of Tokens, Tags and Trees. Studies in Computational and Theoretical Linguistics, vol. 19. ÚFAL, Praha, 2018, ISBN 978-80-88132-09-7.

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (29.01.2019)

HASPELMATH, Martin. The world atlas of language structures. Oxford University Press, 2005.

PETROV, Slav; DAS, Dipanjan; MCDONALD, Ryan. A universal part-of-speech tagset. In: LREC. 2012. p. 2089-2096.

ZEMAN, Daniel. Reusable Tagset Conversion Using Tagset Drivers. In: LREC. 2008. p. 213-218.

ZEMAN, Daniel, et al. HamleDT: To Parse or Not to Parse?. In: LREC. 2012. p. 2735-2741.

MCDONALD, Ryan; LERMAN, Kevin; PEREIRA, Fernando. Multilingual dependency analysis with a two-stage discriminative parser. In: CoNLL. 2006. p. 216-220.

NIVRE, Joakim, et al. Universal dependencies v1: A multilingual treebank collection. In: LREC. 2016. p. 1659-1666.

DAS, Dipanjan; PETROV, Slav. Unsupervised part-of-speech tagging with bilingual graph-based projections. In: ACL-HLT. 2011. p. 600-609.

ZEMAN, Daniel; RESNIK, Philip. Cross-Language Parser Adaptation between Related Languages. In: IJCNLP. 2008. p. 35-42.

TIEDEMANN, Jörg. Parallel Data, Tools and Interfaces in OPUS. In: LREC. 2012. p. 2214-2218.

AGIĆ, Željko; HOVY, Dirk; SØGAARD, Anders. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In: ACL-IJCNLP. 2015. p. 268-272.

AGIĆ, Željko, et al. Multilingual projection for parsing truly low-resource languages. In: TACL. 2016. 301-312.

SØGAARD, Anders. Data point selection for cross-language adaptation of dependency parsers. In: ACL-HLT. 2011. p. 682-686.

TIEDEMANN, Jörg; AGIĆ, Željko; NIVRE, Joakim. Treebank translation for cross-lingual parser induction. In: CoNLL. 2014. p. 130-140.

FORCADA, Mikel L., et al. Apertium: a free/open-source platform for rule-based machine translation. In: Machine translation. 2011. p. 127-144.

JOHNSON, Melvin, et al. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. In: arXiv preprint arXiv:1611.04558. 2016.

ZEMAN, Daniel: The World of Tokens, Tags and Trees. Studies in Computational and Theoretical Linguistics, vol. 19. ÚFAL, Praha, 2018, ISBN 978-80-88132-09-7.

Syllabus -

Last update: RNDr. Daniel Zeman, Ph.D. (05.05.2022)

Introduction to multilinguality (what it is, why it is hard to deal with, what it is good for, WALS)

Plain text (alphabets, transliteration, tokenization, language identification, language similarity)

Machine translation for multilingual processing (Apertium, OPUS, Bible, Watchtower, alignment algorithms, multilingual machine translation)

Morphology (morphological variability of languages, morphological annotation, Universal POS tags, Universal features, tagset conversions, cross-lingual tagging)

Syntax (syntactic variability of languages, harmonization of treebank annotations, Universal Dependencies; multilingual parsing, cross-lingual parsing)

Word embeddings, multilingual embeddings, contextual vector representations.