Exploiting multilingual contextual embeddings for Sinhala text classification

dc.contributor.advisorRanathunga, S
dc.contributor.advisorJayasena, S
dc.contributor.authorDhananjaya, GV
dc.date.accept2022
dc.date.accessioned2024-12-02T04:30:24Z
dc.date.available2024-12-02T04:30:24Z
dc.date.issued2022
dc.description.abstractLanguage models that produce contextual representations (or embeddings) for text have been commonly used in Natural Language Processing (NLP) applications. Partic­ ularly, Transformer based, large pre-trained models are popular among NLP practition­ ers. Nevertheless, the existing research and the inclusion of low-resource languages (languages that primarily lack publicly available datasets and curated corpora) in these modem NLP paradigms are meager. Their performance for downstream NLP tasks lags compared to that of high-resource languages such as English. Training a mono­ lingual Language model for a particular language is a straightforward approach in modem NLP but it is resource-consuming and could be unworkable for a low-resource language where even monolingual training data is insufficient. Multilingual models that can support an array of languages are an alternative to circumvent this issue. Yet, the representation of low-resource languages considerably lags in multilingual models as well. In this work, our first aim is on evaluating the performance of existing Multilingual Language Models (MMLM) that support low-resource Sinhala and some available monolingual Sinhala models for an array of different text classification tasks. We also train our own monolingual model for Sinhala. From those experiments, we identify that the multilingual XLM-R model yields better results in many instances. Based on those results we propose a novel technique based on an explicit cross-lingual alignment of sentiment words using an augmentation method to improve the sentiment classifica­ tion task. There, we improve the results of a multilingual XLM-R model for sentiment classification in Sinhala language. Along the way, we also test the aforementioned method on a few other lndic languages (Tamil, Bengali) to measure its robustness across languages. Keywords: Multilingual language models, Multilingual embeddings, Text classification, Sen­ timent analysis, Low-resource languages, Sinhala languageen_US
dc.identifier.accnoTH5185en_US
dc.identifier.citationDhananjaya, G.V. (2022). Exploiting multilingual contextual embeddings for Sinhala text classification [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22968
dc.identifier.degreeMaster of Science (Major Component of Research)en_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/22968
dc.language.isoenen_US
dc.subjectMULTILINGUAL LANGUAGE MODELS
dc.subjectMULTILINGUAL EMBEDDINGS
dc.subjectLOW-RESOURSE LANGUAGES
dc.subjectSINHALA LANGUAGE
dc.subjectTEXT CLASSIFICATION
dc.subjectSENTIMENT ANALYSIS
dc.subjectINFORMATION TECHNOLOGY - Dissertation
dc.subjectCOMPUTER SCIENCE & ENGINEERING - Dissertation
dc.subjectMSc (Major Component Research)
dc.titleExploiting multilingual contextual embeddings for Sinhala text classificationen_US
dc.typeThesis-Abstracten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH5185-1.pdf
Size:
3 MB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH5185-2.pdf
Size:
14.82 MB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH5185.pdf
Size:
53.06 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: