Structured Data Extraction from Unstructured Text
Thesis title in Czech: | Structured Data Extraction from Unstructured Text |
---|---|
Thesis title in English: | Structured Data Extraction from Unstructured Text |
Key words: | extrakcia štrukturovaných dát, extrakčné pravidlá, (semi)automatická indukcia wrapperov |
English key words: | structured data extraction, extraction rules, (semi)automatic wrapper induction |
Academic year of topic announcement: | 2011/2012 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Software Engineering (32-KSI) |
Supervisor: | doc. Mgr. Martin Nečaský, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 04.11.2011 |
Date of assignment: | 07.11.2011 |
Confirmed by Study dept. on: | 01.12.2011 |
Date and time of defence: | 27.05.2013 11:00 |
Date of electronic submission: | 11.04.2013 |
Date of submission of printed version: | 12.04.2013 |
Date of proceeded defence: | 27.05.2013 |
Opponents: | RNDr. Michal Kopecký, Ph.D. |
Guidelines |
The author of this thesis will approach the problem of automatic structured data extraction from a semi-formatted plain text [1,2,3]. The input is a collection of text documents, an ontology describing the data domain for which the data should be extracted and a configuration file with extraction rules. Basic methods are currently being implemented in the scope of a student software project. In the thesis, the author will extend the basic methods with new methods to improve the performance. He will also compare the success of his method of data extraction with methods published in the current literature. |
References |
[1] Dayne Freitag , Andrew McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, p.584-589, July 30-August 03, 2000
[2] AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen, and Ba-Quy Vuong. 2009. Information extraction challenges in managing unstructured data. SIGMOD Rec. 37, 4 (March 2009), 14-20 [3] Ronen Feldman, James Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. 2006. ISBN 978-0521836579. |