Institutional-Repository, University of Moratuwa.  

Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models

Show simple item record

dc.contributor.advisor Ranathunga S
dc.contributor.author Vithulan MV
dc.date.accessioned 2022
dc.date.available 2022
dc.date.issued 2022
dc.identifier.citation Vithulan, M.V. (2022). Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22381
dc.identifier.uri http://dl.lib.uom.lk/handle/123/22381
dc.description.abstract Organising text articles into groups or clusters is known as document clustering. Documents that belong to a cluster are about the same subject. Document embeddings should be in the same embedding space for the cross-lingual document clustering, i.e., similar documents should have similar vectors. Obtaining document embedding for Tamil and Sinhala is feasible using models like Word2Vec or FastText, however, these embeddings are language specific, i.e., these will not be in the same vector space. Therefore, one cannot cluster documents across the languages using the language specific models. Pre-trained multilingual language models such as mBERT, XLM-R were introduced to solve this problem by transferring the knowledge from high resource languages to low resource languages. This research is conducted to cluster Tamil, Sinhala and English news articles using XLM-R models. An adequate amount of collected documents were clustered, and the clustering techniques and performance were evaluated. This research produces a new baseline for cross-lingual clustering of Tamil, Sinhala, and English documents. en_US
dc.language.iso en en_US
dc.subject CROSS-LINGUAL DOCUMENT CLUSTERING en_US
dc.subject KNOWLEDGE DISTILLATION en_US
dc.subject MBERT en_US
dc.subject MULTILINGUAL LANGUAGE MODELS en_US
dc.subject XLM-R en_US
dc.subject LASER en_US
dc.subject COMPUTER SCIENCE & ENGINEERING - Dissertation en_US
dc.subject COMPUTER SCIENCE- Dissertation en_US
dc.title Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc in Computer Science & Engineering en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2022
dc.identifier.accno TH4931 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record