Exploiting multilingual contextual embeddings for Sinhala text classification

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Language models that produce contextual representations (or embeddings) for text have been commonly used in Natural Language Processing (NLP) applications. Partic­ ularly, Transformer based, large pre-trained models are popular among NLP practition­ ers. Nevertheless, the existing research and the inclusion of low-resource languages (languages that primarily lack publicly available datasets and curated corpora) in these modem NLP paradigms are meager. Their performance for downstream NLP tasks lags compared to that of high-resource languages such as English. Training a mono­ lingual Language model for a particular language is a straightforward approach in modem NLP but it is resource-consuming and could be unworkable for a low-resource language where even monolingual training data is insufficient. Multilingual models that can support an array of languages are an alternative to circumvent this issue. Yet, the representation of low-resource languages considerably lags in multilingual models as well. In this work, our first aim is on evaluating the performance of existing Multilingual Language Models (MMLM) that support low-resource Sinhala and some available monolingual Sinhala models for an array of different text classification tasks. We also train our own monolingual model for Sinhala. From those experiments, we identify that the multilingual XLM-R model yields better results in many instances. Based on those results we propose a novel technique based on an explicit cross-lingual alignment of sentiment words using an augmentation method to improve the sentiment classifica­ tion task. There, we improve the results of a multilingual XLM-R model for sentiment classification in Sinhala language. Along the way, we also test the aforementioned method on a few other lndic languages (Tamil, Bengali) to measure its robustness across languages. Keywords: Multilingual language models, Multilingual embeddings, Text classification, Sen­ timent analysis, Low-resource languages, Sinhala language

Description

Keywords

MULTILINGUAL LANGUAGE MODELS, MULTILINGUAL EMBEDDINGS, LOW-RESOURSE LANGUAGES, SINHALA LANGUAGE, TEXT CLASSIFICATION, SENTIMENT ANALYSIS, INFORMATION TECHNOLOGY - Dissertation, COMPUTER SCIENCE & ENGINEERING - Dissertation, MSc (Major Component Research)

Citation

Dhananjaya, G.V. (2022). Exploiting multilingual contextual embeddings for Sinhala text classification [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22968

DOI