Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Robust Parsing of Noisy Content

Thesis title in Czech:	Robustní parsing zašuměného obsah
Thesis title in English:	Robust Parsing of Noisy Content
Key words:	závislostní syntax, syntaktická analýza, parsing, doménová adaptace
English key words:	dependency syntax, parsing, domain adaptation
Academic year of topic announcement:	2012/2013
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Daniel Zeman, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	06.11.2012
Date of assignment:	08.11.2012
Confirmed by Study dept. on:	27.11.2012
Date and time of defence:	02.09.2013 00:00
Date of electronic submission:	02.08.2013
Date of submission of printed version:	02.08.2013
Date of proceeded defence:	02.09.2013
Opponents:	RNDr. David Mareček, Ph.D.

Guidelines

While parsing performance on in-domain text has developed steadily in recent years, out-of-domain texts and grammatically noisy text remain an obstacle and often lead to significant decreases in parsing accuracy. In this thesis, we focus on parsing of noisy content, as e.g. user-generated content in services like Twitter.

We will compare various strategies for adaptation to noise and explore whether a text-normalization step based on MT techniques and using parallel data, as has been successfully applied to other tasks such as machine translation and part-of-speech tagging, can be used for parsing.

We will further explore semi-supervised and unsupervised methods that do not require parallel data and investigate how a pre-processing step can be integrated with a dependency parser model (MST parser). We will test our approach by comparing various parser configurations on existing datasets for dependency parsing of noisy content (e.g. Twitter messages).

References

McDonald, Ryan, et al. "Non-projective dependency parsing using spanning tree algorithms." Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.

Foster, Jennifer, et al. "# hardtoparse: POS Tagging and Parsing the Twitterverse." proceedings of the Workshop On Analyzing Microtext (AAAI 2011). 2011.

Gadde, Phani, L. V. Subramaniam, and Tanveer A. Faruquie. "Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results." Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, 2011.

Kaufmann, Max, and Jugal Kalita. "Syntactic normalization of Twitter messages." International Conference on Natural Language Processing, Kharagpur, India. 2010.

Petrov, Slav, and Ryan McDonald. "Overview of the 2012 shared task on parsing the web." Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL). 2012.