Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Identifikace kognátů v mluvených datech

Thesis title in Czech:	Identifikace kognátů v mluvených datech
Thesis title in English:	Cognate Identification in Spoken Language Data
Key words:	kognát\|zpracování řeči
English key words:	cognate\|speech processing
Academic year of topic announcement:	2025/2026
Thesis type:	diploma thesis
Thesis language:
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	prof. Ing. Zdeněk Žabokrtský, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	02.05.2025
Date of assignment:	02.05.2025
Confirmed by Study dept. on:	02.05.2025

Guidelines

Cognates, in a broad sense, are words in different languages with a common etymological origin. Cognates play an important role in understanding historical language development [1] and improving multilingual applications like speech translation and cross-lingual information retrieval [2].

While most existing work focuses on cognate identification in written texts [3], identifying cognates in spoken data (such as audio recordings or transcriptions) presents additional challenges due to phonetic variation, pronunciation differences, and potential transcription errors. On the other hand, leveraging spoken input can help bypass issues associated with diverse writing systems and orthographic conventions across languages.

The aim of the thesis is to experimentally explore the possibilities of automatic cognate discovery across languages based on audio recordings. The student will collect suitable data, in particular audio recordings in several languages and existing databases of cognates in written form, and will become familiar with existing software tools for automatic transcription in these languages. Furthermore, the student will propose, design and build a system (using machine learning frameworks) for the automatic extraction of cognates from spoken data, either directly from the audio signal or from its transcription. The student will also design an appropriate evaluation metric to measure the accuracy of the identified cognates and will experimentally assess the performance of different methods.

References

[1] Wu, Mei-Shin, and Johann-Mattis List. "Annotating cognates in phylogenetic studies of Southeast Asian languages." Language Dynamics and Change 13.2 (2023): 161-197.

[2 ] Fourrier, Clémentine, Rachel Bawden, and Benoît Sagot. "Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?." ACL-IJCNLP 2021-Findings of the Association for Computational Linguistics. 2021.

[3] Batsuren, Khuyagbaatar, Gábor Bella, and Fausto Giunchiglia. "A large and evolving cognate database." Language Resources and Evaluation (2022): 1-25.