Tamil news clustering using word embeddings

dc.contributor.authorFayaza, MSF
dc.contributor.authorRanathunga, S
dc.contributor.editorWeeraddana, C
dc.contributor.editorEdussooriya, CUS
dc.contributor.editorAbeysooriya, RP
dc.date.accessioned2022-08-09T09:24:13Z
dc.date.available2022-08-09T09:24:13Z
dc.date.issued2020-07
dc.description.abstractNews aggregators support the readers to view news from multiple news providers via a single point. At the moment, the only news aggregator that supports Tamil news is Google news, which has some noticeable shortages. In this study, Term Frequency–Inverse Document Frequency and word embedding (fastText) document representation techniques were experimented with one pass and affinity propagation clustering algorithms to news title, as well as title and body in order to implement a news aggregator for the Tamil language. For this study we collected data from nine different news providers. When fastText was applied with one pass algorithm to news title and body, it managed to beat other approaches to achieve an average pairwise F-score of 81% with respect to manual clustering. Also, we were able to create a Tamil fastText word embedding model using more than 21 million words.en_US
dc.identifier.citationM. S. Faathima Fayaza and S. Ranathunga, "Tamil News Clustering Using Word Embeddings," 2020 Moratuwa Engineering Research Conference (MERCon), 2020, pp. 277-282, doi: 10.1109/MERCon50084.2020.9185282.en_US
dc.identifier.conferenceMoratuwa Engineering Research Conference 2020en_US
dc.identifier.departmentEngineering Research Unit, University of Moratuwaen_US
dc.identifier.doi10.1109/MERCon50084.2020.9185282en_US
dc.identifier.emailmsf.fayaza89@gmail.comen_US
dc.identifier.emailsurangika@cse.mrt.ac.lken_US
dc.identifier.facultyEngineeringen_US
dc.identifier.pgnospp. 277-282en_US
dc.identifier.placeMoratuwa, Sri Lankaen_US
dc.identifier.proceedingProceedings of Moratuwa Engineering Research Conference 2020en_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/18580
dc.identifier.year2020en_US
dc.language.isoenen_US
dc.publisherIEEEen_US
dc.relation.urihttps://ieeexplore.ieee.org/document/9185282en_US
dc.subjectdocument clusteringen_US
dc.subjectTamilen_US
dc.subjectword embeddingen_US
dc.subjectTerm Frequency–Inverse Document Frequencyen_US
dc.subjectaffinity propagation clusteringen_US
dc.subjectone pass algorithmen_US
dc.titleTamil news clustering using word embeddingsen_US
dc.typeConference-Full-texten_US

Files

Collections