Exploiting multilingual contextual embeddings for Sinhala text classification

Dhananjaya, GV

Exploiting multilingual contextual embeddings for Sinhala text classification

dc.contributor.advisor	Ranathunga, S
dc.contributor.advisor	Jayasena, S
dc.contributor.author	Dhananjaya, GV
dc.date.accept	2022
dc.date.accessioned	2024-12-02T04:30:24Z
dc.date.available	2024-12-02T04:30:24Z
dc.date.issued	2022
dc.description.abstract	Language models that produce contextual representations (or embeddings) for text have been commonly used in Natural Language Processing (NLP) applications. Partic ularly, Transformer based, large pre-trained models are popular among NLP practition ers. Nevertheless, the existing research and the inclusion of low-resource languages (languages that primarily lack publicly available datasets and curated corpora) in these modem NLP paradigms are meager. Their performance for downstream NLP tasks lags compared to that of high-resource languages such as English. Training a mono lingual Language model for a particular language is a straightforward approach in modem NLP but it is resource-consuming and could be unworkable for a low-resource language where even monolingual training data is insufficient. Multilingual models that can support an array of languages are an alternative to circumvent this issue. Yet, the representation of low-resource languages considerably lags in multilingual models as well. In this work, our first aim is on evaluating the performance of existing Multilingual Language Models (MMLM) that support low-resource Sinhala and some available monolingual Sinhala models for an array of different text classification tasks. We also train our own monolingual model for Sinhala. From those experiments, we identify that the multilingual XLM-R model yields better results in many instances. Based on those results we propose a novel technique based on an explicit cross-lingual alignment of sentiment words using an augmentation method to improve the sentiment classifica tion task. There, we improve the results of a multilingual XLM-R model for sentiment classification in Sinhala language. Along the way, we also test the aforementioned method on a few other lndic languages (Tamil, Bengali) to measure its robustness across languages. Keywords: Multilingual language models, Multilingual embeddings, Text classification, Sen timent analysis, Low-resource languages, Sinhala language	en_US
dc.identifier.accno	TH5185	en_US
dc.identifier.citation	Dhananjaya, G.V. (2022). Exploiting multilingual contextual embeddings for Sinhala text classification [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22968
dc.identifier.degree	Master of Science (Major Component of Research)	en_US
dc.identifier.department	Department of Computer Science & Engineering	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/22968
dc.language.iso	en	en_US
dc.subject	MULTILINGUAL LANGUAGE MODELS
dc.subject	MULTILINGUAL EMBEDDINGS
dc.subject	LOW-RESOURSE LANGUAGES
dc.subject	SINHALA LANGUAGE
dc.subject	TEXT CLASSIFICATION
dc.subject	SENTIMENT ANALYSIS
dc.subject	INFORMATION TECHNOLOGY - Dissertation
dc.subject	COMPUTER SCIENCE & ENGINEERING - Dissertation
dc.subject	MSc (Major Component Research)
dc.title	Exploiting multilingual contextual embeddings for Sinhala text classification	en_US
dc.type	Thesis-Full-text	en_US

Files

Original bundle

Now showing 1 - 3 of 3

Name:: TH5185-1.pdf
Size:: 3 MB
Format:: Adobe Portable Document Format
Description:: Pre-text

Download

Name:: TH5185-2.pdf
Size:: 14.82 MB
Format:: Adobe Portable Document Format
Description:: Post-text

Download

Name:: TH5185.pdf
Size:: 53.06 MB
Format:: Adobe Portable Document Format
Description:: Full-thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Master of Science By Research