Robust Parsing of Noisy Content
Thesis title in Czech: | Robustní parsing zašuměného obsah |
---|---|
Thesis title in English: | Robust Parsing of Noisy Content |
Key words: | závislostní syntax, syntaktická analýza, parsing, doménová adaptace |
English key words: | dependency syntax, parsing, domain adaptation |
Academic year of topic announcement: | 2012/2013 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Daniel Zeman, Ph.D. |
Author: | hidden![]() |
Date of registration: | 06.11.2012 |
Date of assignment: | 08.11.2012 |
Confirmed by Study dept. on: | 27.11.2012 |
Date and time of defence: | 02.09.2013 00:00 |
Date of electronic submission: | 02.08.2013 |
Date of submission of printed version: | 02.08.2013 |
Date of proceeded defence: | 02.09.2013 |
Opponents: | RNDr. David Mareček, Ph.D. |
Guidelines |
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain texts and grammatically noisy text remain an obstacle and often lead to significant decreases in parsing accuracy. In this thesis, we focus on parsing of noisy content, as e.g. user-generated content in services like Twitter.
We will compare various strategies for adaptation to noise and explore whether a text-normalization step based on MT techniques and using parallel data, as has been successfully applied to other tasks such as machine translation and part-of-speech tagging, can be used for parsing. We will further explore semi-supervised and unsupervised methods that do not require parallel data and investigate how a pre-processing step can be integrated with a dependency parser model (MST parser). We will test our approach by comparing various parser configurations on existing datasets for dependency parsing of noisy content (e.g. Twitter messages). |
References |
McDonald, Ryan, et al. "Non-projective dependency parsing using spanning tree algorithms." Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.
Foster, Jennifer, et al. "# hardtoparse: POS Tagging and Parsing the Twitterverse." proceedings of the Workshop On Analyzing Microtext (AAAI 2011). 2011. Gadde, Phani, L. V. Subramaniam, and Tanveer A. Faruquie. "Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results." Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, 2011. Kaufmann, Max, and Jugal Kalita. "Syntactic normalization of Twitter messages." International Conference on Natural Language Processing, Kharagpur, India. 2010. Petrov, Slav, and Ryan McDonald. "Overview of the 2012 shared task on parsing the web." Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL). 2012. |