SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Language Technologies for Research in Humanities - NPFL131
Title: Jazykové technologie pro výzkum v humanitních oborech
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2022
Semester: summer
E-Credits: 2
Hours per week, examination: summer s.:0/2, C [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Teaching methods: full-time
Additional information: http://ufal.mff.cuni.cz/courses/npfl131
Guarantor: Mgr. Pavel Straňák, Ph.D.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Incompatibility : NPFL098
Interchangeability : NPFL098
Is incompatible with: NPFL098
Is interchangeable with: NPFL098
Annotation -
Last update: RNDr. Jiří Mírovský, Ph.D. (16.05.2022)
You will learn to efficiently use tools and procedures for the automatic processing of large-scale texts in different languages. The skills acquired will facilitate independent scientific work with language dataq in any area of humanities.
Course completion requirements -
Last update: RNDr. Jiří Mírovský, Ph.D. (16.05.2022)

Ensuring a credit is conditional on active participation in teaching, handing over all homework and earning >70% of the points from these tasks.

Literature -
Last update: RNDr. Jiří Mírovský, Ph.D. (16.05.2022)

presentations from the past: http://ufal.mff.cuni.cz/courses/NPFL131

Learning Perl, 8th Edition (use at least 5th Edition)

Pro Git

Learning the bash Shell

Linux Pocket Guide

Syllabus -
Last update: RNDr. Jiří Mírovský, Ph.D. (16.05.2022)

Using large texts, we will learn the basic methods of text processing needed to obtain non-trivial information. For Czech we will use texts of works by Karel Čapek, for Classical Chinese selected texts from https://github.com/kanripo, for other languages works according to the focus of the students.

importance and statistical properties of Big Data

unix shell; most basic commands

more unix commands and basic Perl to manipulate texts

text editors

quantitative analysis of text

comparing texts and visualizing differences

search using regular expressions

using regular expressions to batch edit text

diacritic removal, sentence segmentation, tokenization

getting information on Chinese characters from Unihan database

rule-based automatic part of speech identification

creating your own corpus

"NLP workflow engines" - GATE, OpenNLP, Treex

calling REST APIs

UDPipe and select the appropriate model if there are more than one for the language

visualization of analysis and results

 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html