Bilingual lexical induction for English-Sinhala
dc.contributor.advisor | Ranathunga S | |
dc.contributor.advisor | Jayasena S | |
dc.contributor.author | Liyanage A | |
dc.date.accept | 2022 | |
dc.date.accessioned | 2022 | |
dc.date.available | 2022 | |
dc.date.issued | 2022 | |
dc.description.abstract | Bilingual Lexicons are important resources appertaining to Natural Language Processing (NLP) applications such as Neural Machine Translation and Named Entity Recognition (NER). However, Low Resource Languages (LRLs) equivalent to Sinhala lack such resources. Manually producing millions of word translations between languages is exhaustive and almost impossible. An increasingly popular approach to automatically create such resources is Bilingual Lexical Induction (BLI). We created the first-ever BLI model for English and Sinhala language pair using the existing popular model VecMap. Currently, no prior work has conducted a sufficient evaluation with respect to the factors, nature of the dataset, type of embedding model used, or the type of evaluation dictionary used on BLI and how these factors affect the results of BLI. We fill the gap by executing an extensive set of experiments with regard to the aforementioned factors on BLI for Sinhala and English in this thesis. Furthermore, we enhance the pre- trai ned embeddi ngs to cater to the appl i cati on by applying sophisticated post-processing approaches. Linear transformation and effective dimensionality reduction are applied to the pre-trained embeddings before obtaining cross-lingual word embeddings between Sinhala and English by applying VecMap. Furthermore, we have introduced dimensionality reduction to the VecMap algorithm where the algorithm starts the first iteration from a low dimension to initialize a better solution. Subsequently, the dimensionality of the embeddings is increased in each iteration until embeddings reach the original di- mension in the final iteration. We were able to improve the results considerably by learning a better initial solution and hence an improved final solution. Finally, we combined the post-processing step with the modified VecMap model to obtain even better mapping for Sinhala-English language pair which in turn is applicable in task-specific downstream systems to improve the results of the entire system. | en_US |
dc.identifier.accno | TH5143 | en_US |
dc.identifier.citation | Liyanage, A. (2022). Bilingual lexical induction for English-Sinhala [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22103 | |
dc.identifier.degree | MSc In Computer Science and Engineering | en_US |
dc.identifier.department | Department of Computer Science and Engineering | en_US |
dc.identifier.faculty | Engineering | en_US |
dc.identifier.uri | http://dl.lib.uom.lk/handle/123/22103 | |
dc.language.iso | en | en_US |
dc.subject | EMBEDDING MODELS | en_US |
dc.subject | BILINGUAL LEXICON INDUCTION | en_US |
dc.subject | EMBEDDING SPACES | en_US |
dc.subject | COMPUTER SCIENCE-Dissertations | en_US |
dc.subject | COMPUTER SCIENCE AND ENGINEERING-Dissertations | en_US |
dc.title | Bilingual lexical induction for English-Sinhala | en_US |
dc.type | Thesis-Full-text | en_US |
Files
Original bundle
1 - 3 of 3
Loading...
- Name:
- TH5143-1.pdf
- Size:
- 3.79 MB
- Format:
- Adobe Portable Document Format
- Description:
- Pre-Text
Loading...
- Name:
- TH5143-2.pdf
- Size:
- 139.28 KB
- Format:
- Adobe Portable Document Format
- Description:
- Post-Text
Loading...
- Name:
- TH5143.pdf
- Size:
- 3.52 MB
- Format:
- Adobe Portable Document Format
- Description:
- Full-theses