Identifikace kognátů v mluvených datech
Thesis title in Czech: | Identifikace kognátů v mluvených datech |
---|---|
Thesis title in English: | Cognate Identification in Spoken Language Data |
Key words: | kognát|zpracování řeči |
English key words: | cognate|speech processing |
Academic year of topic announcement: | 2025/2026 |
Thesis type: | diploma thesis |
Thesis language: | |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | prof. Ing. Zdeněk Žabokrtský, Ph.D. |
Author: | hidden![]() |
Date of registration: | 02.05.2025 |
Date of assignment: | 02.05.2025 |
Confirmed by Study dept. on: | 02.05.2025 |
Guidelines |
Cognates, in a broad sense, are words in different languages with a common etymological origin. Cognates play an important role in understanding historical language development [1] and improving multilingual applications like speech translation and cross-lingual information retrieval [2].
While most existing work focuses on cognate identification in written texts [3], identifying cognates in spoken data (such as audio recordings or transcriptions) presents additional challenges due to phonetic variation, pronunciation differences, and potential transcription errors. On the other hand, leveraging spoken input can help bypass issues associated with diverse writing systems and orthographic conventions across languages. The aim of the thesis is to experimentally explore the possibilities of automatic cognate discovery across languages based on audio recordings. The student will collect suitable data, in particular audio recordings in several languages and existing databases of cognates in written form, and will become familiar with existing software tools for automatic transcription in these languages. Furthermore, the student will propose, design and build a system (using machine learning frameworks) for the automatic extraction of cognates from spoken data, either directly from the audio signal or from its transcription. The student will also design an appropriate evaluation metric to measure the accuracy of the identified cognates and will experimentally assess the performance of different methods. |
References |
[1] Wu, Mei-Shin, and Johann-Mattis List. "Annotating cognates in phylogenetic studies of Southeast Asian languages." Language Dynamics and Change 13.2 (2023): 161-197.
[2 ] Fourrier, Clémentine, Rachel Bawden, and Benoît Sagot. "Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?." ACL-IJCNLP 2021-Findings of the Association for Computational Linguistics. 2021. [3] Batsuren, Khuyagbaatar, Gábor Bella, and Fausto Giunchiglia. "A large and evolving cognate database." Language Resources and Evaluation (2022): 1-25. |