Bilingual lexicon induction for the Sinhala-English language pair
Loading...
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner yet monolingual embedding alignment techniques are used in low-resource languages like Sinhala. Our main focus here is to improve the Sinhala word embedding alignment. Here in this research, we experiment with the available monolingual embedding alignment techniques to have the best Sinhala-English embedding alignment that have been achieved so far. For that, first we introduce a large-scale Sinhala-English wordlevel parallel dictionary that facilitates any word-level cross-lingual tasks. Next, we align Sinhala and English embedding spaces using the available embedding alignment techniques. We identify that Bilingual Lexicon Induction does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI technique to evaluate two aligned embedding spaces which take into account the inflected nature of languages as opposed to the prevalent wordbased BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. We extend our experiments to 8 other languages as well and prove the validity of our methods. In addition to that using all the languages we conduct a comparative study about the quality of the aligned monolingual embeddings, multilingual embeddings, and hybrid-aligned embeddings. We have published two conference papers during the research so far and we have released all the resources in GitHub 12 to open-source usage for the research community.
Description
Citation
Wickramasinghe, K. (2024). Bilingual lexicon induction for the Sinhala-English language pair [Master’s theses, University of Moratuwa]. , University of Moratuwa]. Institutional Repository University of Moratuwa. https://dl.lib.uom.lk/handle/123/23971
