Bilingual lexical induction for English-Sinhala

Liyanage A

Bilingual lexical induction for English-Sinhala

dc.contributor.advisor	Ranathunga S
dc.contributor.advisor	Jayasena S
dc.contributor.author	Liyanage A
dc.date.accept	2022
dc.date.accessioned	2022
dc.date.available	2022
dc.date.issued	2022
dc.description.abstract	Bilingual Lexicons are important resources appertaining to Natural Language Processing (NLP) applications such as Neural Machine Translation and Named Entity Recognition (NER). However, Low Resource Languages (LRLs) equivalent to Sinhala lack such resources. Manually producing millions of word translations between languages is exhaustive and almost impossible. An increasingly popular approach to automatically create such resources is Bilingual Lexical Induction (BLI). We created the ﬁrst-ever BLI model for English and Sinhala language pair using the existing popular model VecMap. Currently, no prior work has conducted a suﬃcient evaluation with respect to the factors, nature of the dataset, type of embedding model used, or the type of evaluation dictionary used on BLI and how these factors aﬀect the results of BLI. We ﬁll the gap by executing an extensive set of experiments with regard to the aforementioned factors on BLI for Sinhala and English in this thesis. Furthermore, we enhance the pre- trai ned embeddi ngs to cater to the appl i cati on by applying sophisticated post-processing approaches. Linear transformation and eﬀective dimensionality reduction are applied to the pre-trained embeddings before obtaining cross-lingual word embeddings between Sinhala and English by applying VecMap. Furthermore, we have introduced dimensionality reduction to the VecMap algorithm where the algorithm starts the ﬁrst iteration from a low dimension to initialize a better solution. Subsequently, the dimensionality of the embeddings is increased in each iteration until embeddings reach the original di- mension in the ﬁnal iteration. We were able to improve the results considerably by learning a better initial solution and hence an improved ﬁnal solution. Finally, we combined the post-processing step with the modiﬁed VecMap model to obtain even better mapping for Sinhala-English language pair which in turn is applicable in task-speciﬁc downstream systems to improve the results of the entire system.	en_US
dc.identifier.accno	TH5143	en_US
dc.identifier.citation	Liyanage, A. (2022). Bilingual lexical induction for English-Sinhala [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22103
dc.identifier.degree	MSc In Computer Science and Engineering	en_US
dc.identifier.department	Department of Computer Science and Engineering	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/22103
dc.language.iso	en	en_US
dc.subject	EMBEDDING MODELS	en_US
dc.subject	BILINGUAL LEXICON INDUCTION	en_US
dc.subject	EMBEDDING SPACES	en_US
dc.subject	COMPUTER SCIENCE-Dissertations	en_US
dc.subject	COMPUTER SCIENCE AND ENGINEERING-Dissertations	en_US
dc.title	Bilingual lexical induction for English-Sinhala	en_US
dc.type	Thesis-Full-text	en_US

Files

Original bundle

Now showing 1 - 3 of 3

Name:: TH5143-1.pdf
Size:: 3.79 MB
Format:: Adobe Portable Document Format
Description:: Pre-Text

Download

Name:: TH5143-2.pdf
Size:: 139.28 KB
Format:: Adobe Portable Document Format
Description:: Post-Text

Download

Name:: TH5143.pdf
Size:: 3.52 MB
Format:: Adobe Portable Document Format
Description:: Full-theses

Download

Collections

Master of Science in Computer science and Engineering