Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models

Vithulan MV

Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models

dc.contributor.advisor	Ranathunga S
dc.contributor.author	Vithulan MV
dc.date.accept	2022
dc.date.accessioned	2022
dc.date.available	2022
dc.date.issued	2022
dc.description.abstract	Organising text articles into groups or clusters is known as document clustering. Documents that belong to a cluster are about the same subject. Document embeddings should be in the same embedding space for the cross-lingual document clustering, i.e., similar documents should have similar vectors. Obtaining document embedding for Tamil and Sinhala is feasible using models like Word2Vec or FastText, however, these embeddings are language specific, i.e., these will not be in the same vector space. Therefore, one cannot cluster documents across the languages using the language specific models. Pre-trained multilingual language models such as mBERT, XLM-R were introduced to solve this problem by transferring the knowledge from high resource languages to low resource languages. This research is conducted to cluster Tamil, Sinhala and English news articles using XLM-R models. An adequate amount of collected documents were clustered, and the clustering techniques and performance were evaluated. This research produces a new baseline for cross-lingual clustering of Tamil, Sinhala, and English documents.	en_US
dc.identifier.accno	TH4931	en_US
dc.identifier.citation	Vithulan, M.V. (2022). Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22381
dc.identifier.degree	MSc in Computer Science & Engineering	en_US
dc.identifier.department	Department of Computer Science & Engineering	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/22381
dc.language.iso	en	en_US
dc.subject	CROSS-LINGUAL DOCUMENT CLUSTERING	en_US
dc.subject	KNOWLEDGE DISTILLATION	en_US
dc.subject	MBERT	en_US
dc.subject	MULTILINGUAL LANGUAGE MODELS	en_US
dc.subject	XLM-R	en_US
dc.subject	LASER	en_US
dc.subject	COMPUTER SCIENCE & ENGINEERING - Dissertation	en_US
dc.subject	COMPUTER SCIENCE- Dissertation	en_US
dc.title	Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models	en_US
dc.type	Thesis-Full-text	en_US

Files

Original bundle

Now showing 1 - 3 of 3

Name:: TH4931-1.pdf
Size:: 105.11 KB
Format:: Adobe Portable Document Format
Description:: Pre-Text

Download

Name:: TH4931-2.pdf
Size:: 94.81 KB
Format:: Adobe Portable Document Format
Description:: Post- Text

Download

Name:: TH4931.pdf
Size:: 1.12 MB
Format:: Adobe Portable Document Format
Description:: Full theses

Download

Collections

Master of Science in Computer science and Engineering