Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Velký mnohojazyčný korpus

Thesis title in Czech:	Velký mnohojazyčný korpus
Thesis title in English:	Large Multilingual Corpus
Key words:	jazykový korpus, distribuované zpracování
English key words:	language corpus, distributed processing
Academic year of topic announcement:	2010/2011
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	prof. Ing. Zdeněk Žabokrtský, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	22.03.2011
Date of assignment:	22.03.2011
Date and time of defence:	06.09.2011 00:00
Date of electronic submission:	05.08.2011
Date of submission of printed version:	05.08.2011
Date of proceeded defence:	06.09.2011
Opponents:	RNDr. Miroslav Spousta

Guidelines

Cílem práce je navrhnout a implementovat softwarový systém, který s využitím textů dostupných na WWW sestaví mnohojazyčný neanotovaný korpus.
V korpusu bude obsaženo několik stovek jazyků. Minimální celková velikost textů pro každý jednotlivý jazyk bude deset miliónů slov. S ohledem na
velké množství dat bude korpus vytvářen distribuovaným způsobem. Vedle kvantitativní a kvalitativní analýzy shromážděného materiálu bude práce
dokumentovat i řešení dílčích problémů vyvstávajících pro jednotlivé jazyky nebo jazykové skupiny.

References

Wynne, M (editor). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. 2005

Lin, J.; Dryer, C.: Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. 2010

Český národní korpus: Úvod a příručka uživatele. FF UK. 2000

Conway, D.: Perl Best Practices. Standards and Styles for Developing Maintainable Code. O'Reilly Media. 2005