Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models

dc.contributor.advisorRanathunga S
dc.contributor.authorVithulan MV
dc.date.accept2022
dc.date.accessioned2022
dc.date.available2022
dc.date.issued2022
dc.description.abstractOrganising text articles into groups or clusters is known as document clustering. Documents that belong to a cluster are about the same subject. Document embeddings should be in the same embedding space for the cross-lingual document clustering, i.e., similar documents should have similar vectors. Obtaining document embedding for Tamil and Sinhala is feasible using models like Word2Vec or FastText, however, these embeddings are language specific, i.e., these will not be in the same vector space. Therefore, one cannot cluster documents across the languages using the language specific models. Pre-trained multilingual language models such as mBERT, XLM-R were introduced to solve this problem by transferring the knowledge from high resource languages to low resource languages. This research is conducted to cluster Tamil, Sinhala and English news articles using XLM-R models. An adequate amount of collected documents were clustered, and the clustering techniques and performance were evaluated. This research produces a new baseline for cross-lingual clustering of Tamil, Sinhala, and English documents.en_US
dc.identifier.accnoTH4931en_US
dc.identifier.citationVithulan, M.V. (2022). Cross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language models [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22381
dc.identifier.degreeMSc in Computer Science & Engineeringen_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/22381
dc.language.isoenen_US
dc.subjectCROSS-LINGUAL DOCUMENT CLUSTERINGen_US
dc.subjectKNOWLEDGE DISTILLATIONen_US
dc.subjectMBERTen_US
dc.subjectMULTILINGUAL LANGUAGE MODELSen_US
dc.subjectXLM-Ren_US
dc.subjectLASERen_US
dc.subjectCOMPUTER SCIENCE & ENGINEERING - Dissertationen_US
dc.subjectCOMPUTER SCIENCE- Dissertationen_US
dc.titleCross- lingual document clustering for Sinhala,Tamil, and English using pre-trained multilingual language modelsen_US
dc.typeThesis-Abstracten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH4931-1.pdf
Size:
105.11 KB
Format:
Adobe Portable Document Format
Description:
Pre-Text
Loading...
Thumbnail Image
Name:
TH4931-2.pdf
Size:
94.81 KB
Format:
Adobe Portable Document Format
Description:
Post- Text
Loading...
Thumbnail Image
Name:
TH4931.pdf
Size:
1.12 MB
Format:
Adobe Portable Document Format
Description:
Full theses