Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Structured Data Extraction from Unstructured Text
Thesis title in Czech: Structured Data Extraction from Unstructured Text
Thesis title in English: Structured Data Extraction from Unstructured Text
Key words: extrakcia štrukturovaných dát, extrakčné pravidlá, (semi)automatická indukcia wrapperov
English key words: structured data extraction, extraction rules, (semi)automatic wrapper induction
Academic year of topic announcement: 2011/2012
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Software Engineering (32-KSI)
Supervisor: doc. Mgr. Martin Nečaský, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 04.11.2011
Date of assignment: 07.11.2011
Confirmed by Study dept. on: 01.12.2011
Date and time of defence: 27.05.2013 11:00
Date of electronic submission:11.04.2013
Date of submission of printed version:12.04.2013
Date of proceeded defence: 27.05.2013
Opponents: RNDr. Michal Kopecký, Ph.D.
 
 
 
Guidelines
The author of this thesis will approach the problem of automatic structured data extraction from a semi-formatted plain text [1,2,3]. The input is a collection of text documents, an ontology describing the data domain for which the data should be extracted and a configuration file with extraction rules. Basic methods are currently being implemented in the scope of a student software project. In the thesis, the author will extend the basic methods with new methods to improve the performance. He will also compare the success of his method of data extraction with methods published in the current literature.
References
[1] Dayne Freitag , Andrew McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, p.584-589, July 30-August 03, 2000

[2] AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen, and Ba-Quy Vuong. 2009. Information extraction challenges in managing unstructured data. SIGMOD Rec. 37, 4 (March 2009), 14-20

[3] Ronen Feldman, James Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. 2006. ISBN 978-0521836579.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html