Bilingual lexicon induction for the Sinhala-English language pair

dc.contributor.advisorDe Silva , N
dc.contributor.authorWickramasinghe, K
dc.date.accept2024
dc.date.accessioned2025-08-18T04:39:19Z
dc.date.issued2024
dc.description.abstractSans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner yet monolingual embedding alignment techniques are used in low-resource languages like Sinhala. Our main focus here is to improve the Sinhala word embedding alignment. Here in this research, we experiment with the available monolingual embedding alignment techniques to have the best Sinhala-English embedding alignment that have been achieved so far. For that, first we introduce a large-scale Sinhala-English wordlevel parallel dictionary that facilitates any word-level cross-lingual tasks. Next, we align Sinhala and English embedding spaces using the available embedding alignment techniques. We identify that Bilingual Lexicon Induction does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI technique to evaluate two aligned embedding spaces which take into account the inflected nature of languages as opposed to the prevalent wordbased BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. We extend our experiments to 8 other languages as well and prove the validity of our methods. In addition to that using all the languages we conduct a comparative study about the quality of the aligned monolingual embeddings, multilingual embeddings, and hybrid-aligned embeddings. We have published two conference papers during the research so far and we have released all the resources in GitHub 12 to open-source usage for the research community.
dc.identifier.accnoTH5664
dc.identifier.citationWickramasinghe, K. (2024). Bilingual lexicon induction for the Sinhala-English language pair [Master’s theses, University of Moratuwa]. , University of Moratuwa]. Institutional Repository University of Moratuwa. https://dl.lib.uom.lk/handle/123/23971
dc.identifier.degreeMSc in Computer Science
dc.identifier.departmentDepartment of Computer Science & Engineering
dc.identifier.facultyEngineering
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/23971
dc.language.isoen
dc.subjectWORD EMBEDDING ALIGNMENT
dc.subjectMULTILINGUALISM
dc.subjectBILINGUAL LEXICON INDUCTION
dc.subjectALIGNMET DICTIONARIES
dc.subjectLANGUAGE-AGNOTSTIC PROCESSING
dc.subjectMULTILINGUAL EMBEDDING
dc.subjectINFLECTED LANGUAGES
dc.subjectMEASURE ALIGNMENT
dc.subjectSINHALA LANGUAGE
dc.subjectCOMPUTER SCIEND AND ENGINEERING-Dissertation
dc.subjectMSc in Computer Science
dc.titleBilingual lexicon induction for the Sinhala-English language pair
dc.typeThesis-Abstract

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH5664-1.pdf
Size:
123.96 KB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH5664-2.pdf
Size:
100.22 KB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH5664.pdf
Size:
1.04 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: