Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Automatická identifikace kreativity pro použití ve strojovém překladu

Název práce v češtině:	Automatická identifikace kreativity pro použití ve strojovém překladu
Název v anglickém jazyce:	Automatic Identification of Creativity for use in Machine Translation
Klíčová slova:	kreativita\|NLP\|strojový překlad
Klíčová slova anglicky:	creativity\|NLP\|machine translation
Akademický rok vypsání:	2024/2025
Typ práce:	diplomová práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	Mgr. Rudolf Rosa, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	06.03.2025
Datum zadání:	06.03.2025
Datum potvrzení stud. oddělením:	06.03.2025

Zásady pro vypracování

PRESENT STUDY

The concept of creativity is difficult to define in a concrete and objective manner. Its subjective nature makes it one of the most challenging dimensions of human intelligence to encode into AI models [3], which naturally correlates to the difficulty in pinpointing spans of text that exhibit creative features.

Research on the identification of creativity in translation has been carried out at the University of Groningen, focusing on the impact of Machine Translation in the experience of professional translators, as well as that of the readers [2,1]. They find that using MT to any extent during the translation process significantly restricts creativity, and notably define two key aspects of a creative product in translation - novelty and usefulness. The former corresponds to the conception of new solutions in the target text for problems posed by the original text, whilst the latter measures whether the translation is acceptable in the intended communication context. This research also finds that the amount of creativity in translations decreases proportionally to the increase in MT involvement in the translation process. Other research suggests that this observation is true for all AI-generated texts [7].

Previous master’s theses following Guerberof-Arenas and Toral’s steps explored the effectiveness of using parallel multilingual corpora to detect creative shifts [5,6,4]. Creative shifts occur in parallel texts, where the translation of a certain span does not correlate word-by-word to its original text, thus suggesting the use of creativity to derive the final translation. By measuring the distance between the aligned embeddings of texts and their corresponding professional translations, they are able to detect where the translator has made a choice to translate non-literally, which usually correlates to a creative solution. These methods seem to work well, but have difficulty differentiating between creative shifts and reproductions. Under the assumption that reproductions do not occur within texts written in the same language, using a monolingual parallel corpus of differently localized versions of media could potentially provide some insight into what differentiates a creative shift from a reproduction.

Measures such as Creativity Score [1] and Creativity Index [7] seem to be successful in capturing how creative a text is as a whole, and provide proof that human- and AI-generated texts differ significantly in how creative they are. The Creativity Index method shows promise for identifying creative units in scenarios where no reference translation is available. This method relies on n-gram frequency in a large corpus to estimate novelty, assuming that less frequent n-grams indicate greater creativity. This could however make this approach oblivious to the challenging nature of idiomatic phrases in the field of Machine Translation. These figures of speech are likely to appear as an n-gram in other works, but still require creative skills to be translated correctly.

This study aims to contribute towards the development of a system capable of detecting units of creative potential in scenarios where there is no professional translation available. Such a system might prove itself useful for MT evaluation as a way to determine the certainty of a candidate translation, as well as potentially enhancing the post-edition experience for professional translators.

More precisely, the goal of this investigation is to research different ways of identifying spans of text that exhibit higher levels of creativity.

Research Questions:
RQ1 - How can we automate the detection of units of creative potential when only the source text is available?
RQ2 - Can using a monolingual parallel corpus featuring different localized versions of a text help distinguish reproductions from creative shifts?

DISCUSSION OF METHODS

Data collection

In order to obtain different localizations of translated works into the same language, audiovisual media in Spanish is likely the largest resource available. Most films and series are translated into both Spanish from Spain and a neutralized version of Latin American Spanish, thus providing a comparable corpus.

Data annotation

The annotation process is to be completed manually by the student. A potential avenue for simplifying this process and maximizing the possible volume of annotated data is to automatically highlight candidate spans of text, which can then be manually assessed. This avenue however must be studied with caution, as using metrics too similar to those used in the subsequent steps could artificially create good results. One possible way to tackle this is using the measures used in \cite{netflix_thesis} using the original version of the work as the reference during annotation, whilst using both translated versions into the same language in the actual procedure.

Detection of creative shifts using monolingual data

The methods used here will be similar to those used in the previous master's theses [5,6,4], with the main difference in this aspect being the nature of the data.
These methods include static, contextual and syntactic similarity metrics, which measure the distances between words, sentences, and syntactic structures between texts in a parallel corpus, respectively. The combination of these could potentially be explored during the development of this thesis.

Detection of creativity based on source text only

The methods used for this aspect of the investigation are not yet defined. Taking inspiration in the Creativity Index measure [7] is a possibility, however other potential methods might be identified during the background reading step.

Seznam odborné literatury

[1] Ana Guerberof-Arenas and Antonio Toral. “Creativity in translation: Machine translation as a constraint for literary texts”. In: Translation Spaces 11.2 (Mar. 2022), pp. 184–212. issn: 2211-372X. doi: 10.1075/ts.21025.gue. url: http://dx.doi.org/10.1075/ts.21025.gue.
[2] Ana Guerberof-Arenas and Antonio Toral. “The impact of post-editing and machine translation on creativity and reading experience”. In: Translation Spaces 9.2 (Nov. 2020), pp. 255–282. issn: 2211-372X. doi: 10.1075/ts.20035.gue. url: http://dx.doi.org/10.1075/ts.20035.gue.
[3] William Hasselberger and Micah Lott. “Where lies the grail? AI, common sense, and human practical intelligence”. In: Phenomenology and the Cognitive Sciences (Oct. 2023), pp. 1–22. doi: 10.1007/s11097-023-09942-x.
[4] Mayzie van der Heden. “An Automatic Analysis of Reproductions in English to Dutch Subtitles for Different Film Genres”. Master’s Thesis. University of Groningen, 2021.
[5] Berend Andreas Korporaal. “How Creative Are Translated Subtitles? Automating the Detection of Creative Shifts in English-to-Dutch Subtitles”. Master’s Thesis. University of Groningen, 2023.
[6] Jiahui Liang. “Exploring Automation in The Analysis of Creative Shifts Found in News Translations: Machine vs Human Translation”. Master’s Thesis. University of Groningen, 2021.
[7] Ximing Lu et al. AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text. 2025. arXiv: 2410.04265 [cs.CL]. url: https://arxiv.org/abs/2410.04265.