Institutional-Repository, University of Moratuwa.  

Exploiting multilingual contextual embeddings for Sinhala text classification

Show simple item record

dc.contributor.advisor Ranathunga, S
dc.contributor.advisor Jayasena, S
dc.contributor.author Dhananjaya, GV
dc.date.accessioned 2024-12-02T04:30:24Z
dc.date.available 2024-12-02T04:30:24Z
dc.date.issued 2022
dc.identifier.citation Dhananjaya, G.V. (2022). Exploiting multilingual contextual embeddings for Sinhala text classification [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22968
dc.identifier.uri http://dl.lib.uom.lk/handle/123/22968
dc.description.abstract Language models that produce contextual representations (or embeddings) for text have been commonly used in Natural Language Processing (NLP) applications. Partic­ ularly, Transformer based, large pre-trained models are popular among NLP practition­ ers. Nevertheless, the existing research and the inclusion of low-resource languages (languages that primarily lack publicly available datasets and curated corpora) in these modem NLP paradigms are meager. Their performance for downstream NLP tasks lags compared to that of high-resource languages such as English. Training a mono­ lingual Language model for a particular language is a straightforward approach in modem NLP but it is resource-consuming and could be unworkable for a low-resource language where even monolingual training data is insufficient. Multilingual models that can support an array of languages are an alternative to circumvent this issue. Yet, the representation of low-resource languages considerably lags in multilingual models as well. In this work, our first aim is on evaluating the performance of existing Multilingual Language Models (MMLM) that support low-resource Sinhala and some available monolingual Sinhala models for an array of different text classification tasks. We also train our own monolingual model for Sinhala. From those experiments, we identify that the multilingual XLM-R model yields better results in many instances. Based on those results we propose a novel technique based on an explicit cross-lingual alignment of sentiment words using an augmentation method to improve the sentiment classifica­ tion task. There, we improve the results of a multilingual XLM-R model for sentiment classification in Sinhala language. Along the way, we also test the aforementioned method on a few other lndic languages (Tamil, Bengali) to measure its robustness across languages. Keywords: Multilingual language models, Multilingual embeddings, Text classification, Sen­ timent analysis, Low-resource languages, Sinhala language en_US
dc.language.iso en en_US
dc.subject MULTILINGUAL LANGUAGE MODELS
dc.subject MULTILINGUAL EMBEDDINGS
dc.subject LOW-RESOURSE LANGUAGES
dc.subject SINHALA LANGUAGE
dc.subject TEXT CLASSIFICATION
dc.subject SENTIMENT ANALYSIS
dc.subject INFORMATION TECHNOLOGY - Dissertation
dc.subject COMPUTER SCIENCE & ENGINEERING - Dissertation
dc.subject MSc (Major Component Research)
dc.title Exploiting multilingual contextual embeddings for Sinhala text classification en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree Master of Science (Major Component of Research) en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2022
dc.identifier.accno TH5185 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record