Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Measuring lexical surprisal in legal texts

Thesis title in Czech:	Měření lexikálního překvapení v právních textech
Thesis title in English:	Measuring lexical surprisal in legal texts
Key words:	Zpracování přirozeného jazyka\|Transformery\|GPT\|Lexikální překvapení\|Informační entropie\|Porozumitelnost textu\|Výpočetní lingvistika
English key words:	Natural language processing\|Transformers\|GPT\|Lexical surprise\|Information entropy\|Text comprehension\|Computational linguistics
Academic year of topic announcement:	2023/2024
Thesis type:	Bachelor's thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	Mgr. Ivana Kvapilíková, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	30.04.2024
Date of assignment:	30.04.2024
Confirmed by Study dept. on:	18.07.2024
Date and time of defence:	20.06.2025 09:00
Date of electronic submission:	07.05.2025
Date of submission of printed version:	09.01.2025
Opponents:	Mgr. Jan Hajič, Ph.D.

Guidelines

Právní texty jsou často nesrozumitelné a pro veřejnost těžko čitelné. Podle dostupné literatury existuje vztah mezi srozumitelností textu a jeho předvídatelností (tzv. lexikálním překvapením). Cílem této práce je vytvořit nástroj pro měření předvídatelnosti textu, pomocí kterého si uživatel (autor textu) dokáže zobrazit, které pasáže mohou být pro čtenáře nesrozumitelné, aby měl možnost je upravit.

1. Bude vytvořen nástroj pro měření lexikálního překvapení psaného textu. Pro vyhodnocení pravděpodobností jednotlivých slov bude použit předtrénovaný jazykový model.
2. Součástí aplikace bude API rozhraní a frontend klient, který bude pro jednotlivá slova zobrazovat jejich pravděpodobnosti.
3. Nástroj bude aplikován na originální a zjednodušené české právní texty z korpusu KUK 0.0.

References

Silvie Cinková et al., 2023. LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Comprehension for Readability Studies (2023-10-08), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-5225.

Silvie Cinková et al., 2024. Linguistic Factors in the Readability of Czech Administrative and Legal Texts. In Review.

Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, Florence, Italy. Association for Computational Linguistics.

Irene Fernandez Monsalve, Stefan L. Frank, and Gabriella Vigliocco. 2012. Lexical surprisal as a general predictor of reading time. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 398–408, Avignon, France. Association for Computational Linguistics.

Milan Straka, Jakub Náplava, Jana Straková, David Samuel. 2021. RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model. In: Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol 12848. Springer, Cham.