Bilingual lexical induction for English-Sinhala

dc.contributor.advisorRanathunga S
dc.contributor.advisorJayasena S
dc.contributor.authorLiyanage A
dc.date.accept2022
dc.date.accessioned2022
dc.date.available2022
dc.date.issued2022
dc.description.abstractBilingual Lexicons are important resources appertaining to Natural Language Processing (NLP) applications such as Neural Machine Translation and Named Entity Recognition (NER). However, Low Resource Languages (LRLs) equivalent to Sinhala lack such resources. Manually producing millions of word translations between languages is exhaustive and almost impossible. An increasingly popular approach to automatically create such resources is Bilingual Lexical Induction (BLI). We created the first-ever BLI model for English and Sinhala language pair using the existing popular model VecMap. Currently, no prior work has conducted a sufficient evaluation with respect to the factors, nature of the dataset, type of embedding model used, or the type of evaluation dictionary used on BLI and how these factors affect the results of BLI. We fill the gap by executing an extensive set of experiments with regard to the aforementioned factors on BLI for Sinhala and English in this thesis. Furthermore, we enhance the pre- trai ned embeddi ngs to cater to the appl i cati on by applying sophisticated post-processing approaches. Linear transformation and effective dimensionality reduction are applied to the pre-trained embeddings before obtaining cross-lingual word embeddings between Sinhala and English by applying VecMap. Furthermore, we have introduced dimensionality reduction to the VecMap algorithm where the algorithm starts the first iteration from a low dimension to initialize a better solution. Subsequently, the dimensionality of the embeddings is increased in each iteration until embeddings reach the original di- mension in the final iteration. We were able to improve the results considerably by learning a better initial solution and hence an improved final solution. Finally, we combined the post-processing step with the modified VecMap model to obtain even better mapping for Sinhala-English language pair which in turn is applicable in task-specific downstream systems to improve the results of the entire system.en_US
dc.identifier.accnoTH5143en_US
dc.identifier.citationLiyanage, A. (2022). Bilingual lexical induction for English-Sinhala [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22103
dc.identifier.degreeMSc In Computer Science and Engineeringen_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/22103
dc.language.isoenen_US
dc.subjectEMBEDDING MODELSen_US
dc.subjectBILINGUAL LEXICON INDUCTIONen_US
dc.subjectEMBEDDING SPACESen_US
dc.subjectCOMPUTER SCIENCE-Dissertationsen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING-Dissertationsen_US
dc.titleBilingual lexical induction for English-Sinhalaen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH5143-1.pdf
Size:
3.79 MB
Format:
Adobe Portable Document Format
Description:
Pre-Text
Loading...
Thumbnail Image
Name:
TH5143-2.pdf
Size:
139.28 KB
Format:
Adobe Portable Document Format
Description:
Post-Text
Loading...
Thumbnail Image
Name:
TH5143.pdf
Size:
3.52 MB
Format:
Adobe Portable Document Format
Description:
Full-theses