Show simple item record

dc.contributor.advisor Ranathunga S
dc.contributor.advisor Jayasena S
dc.contributor.author Liyanage A
dc.date.accessioned 2022
dc.date.available 2022
dc.date.issued 2022
dc.identifier.citation Liyanage, A. (2022). Bilingual lexical induction for English-Sinhala [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22103
dc.identifier.uri http://dl.lib.uom.lk/handle/123/22103
dc.description.abstract Bilingual Lexicons are important resources appertaining to Natural Language Processing (NLP) applications such as Neural Machine Translation and Named Entity Recognition (NER). However, Low Resource Languages (LRLs) equivalent to Sinhala lack such resources. Manually producing millions of word translations between languages is exhaustive and almost impossible. An increasingly popular approach to automatically create such resources is Bilingual Lexical Induction (BLI). We created the first-ever BLI model for English and Sinhala language pair using the existing popular model VecMap. Currently, no prior work has conducted a sufficient evaluation with respect to the factors, nature of the dataset, type of embedding model used, or the type of evaluation dictionary used on BLI and how these factors affect the results of BLI. We fill the gap by executing an extensive set of experiments with regard to the aforementioned factors on BLI for Sinhala and English in this thesis. Furthermore, we enhance the pre- trai ned embeddi ngs to cater to the appl i cati on by applying sophisticated post-processing approaches. Linear transformation and effective dimensionality reduction are applied to the pre-trained embeddings before obtaining cross-lingual word embeddings between Sinhala and English by applying VecMap. Furthermore, we have introduced dimensionality reduction to the VecMap algorithm where the algorithm starts the first iteration from a low dimension to initialize a better solution. Subsequently, the dimensionality of the embeddings is increased in each iteration until embeddings reach the original di- mension in the final iteration. We were able to improve the results considerably by learning a better initial solution and hence an improved final solution. Finally, we combined the post-processing step with the modified VecMap model to obtain even better mapping for Sinhala-English language pair which in turn is applicable in task-specific downstream systems to improve the results of the entire system. en_US
dc.language.iso en en_US
dc.subject EMBEDDING MODELS en_US
dc.subject BILINGUAL LEXICON INDUCTION en_US
dc.subject EMBEDDING SPACES en_US
dc.subject COMPUTER SCIENCE-Dissertations en_US
dc.subject COMPUTER SCIENCE AND ENGINEERING-Dissertations en_US
dc.title Bilingual lexical induction for English-Sinhala en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc In Computer Science and Engineering en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.date.accept 2022
dc.identifier.accno TH5143 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record