Tamil news clustering using word embeddings

Fayaza, MSF; Ranathunga, S

Tamil news clustering using word embeddings

dc.contributor.author	Fayaza, MSF
dc.contributor.author	Ranathunga, S
dc.contributor.editor	Weeraddana, C
dc.contributor.editor	Edussooriya, CUS
dc.contributor.editor	Abeysooriya, RP
dc.date.accessioned	2022-08-09T09:24:13Z
dc.date.available	2022-08-09T09:24:13Z
dc.date.issued	2020-07
dc.description.abstract	News aggregators support the readers to view news from multiple news providers via a single point. At the moment, the only news aggregator that supports Tamil news is Google news, which has some noticeable shortages. In this study, Term Frequency–Inverse Document Frequency and word embedding (fastText) document representation techniques were experimented with one pass and affinity propagation clustering algorithms to news title, as well as title and body in order to implement a news aggregator for the Tamil language. For this study we collected data from nine different news providers. When fastText was applied with one pass algorithm to news title and body, it managed to beat other approaches to achieve an average pairwise F-score of 81% with respect to manual clustering. Also, we were able to create a Tamil fastText word embedding model using more than 21 million words.	en_US
dc.identifier.citation	M. S. Faathima Fayaza and S. Ranathunga, "Tamil News Clustering Using Word Embeddings," 2020 Moratuwa Engineering Research Conference (MERCon), 2020, pp. 277-282, doi: 10.1109/MERCon50084.2020.9185282.	en_US
dc.identifier.conference	Moratuwa Engineering Research Conference 2020	en_US
dc.identifier.department	Engineering Research Unit, University of Moratuwa	en_US
dc.identifier.doi	10.1109/MERCon50084.2020.9185282	en_US
dc.identifier.email	msf.fayaza89@gmail.com	en_US
dc.identifier.email	surangika@cse.mrt.ac.lk	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.pgnos	pp. 277-282	en_US
dc.identifier.place	Moratuwa, Sri Lanka	en_US
dc.identifier.proceeding	Proceedings of Moratuwa Engineering Research Conference 2020	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/18580
dc.identifier.year	2020	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE	en_US
dc.relation.uri	https://ieeexplore.ieee.org/document/9185282	en_US
dc.subject	document clustering	en_US
dc.subject	Tamil	en_US
dc.subject	word embedding	en_US
dc.subject	Term Frequency–Inverse Document Frequency	en_US
dc.subject	affinity propagation clustering	en_US
dc.subject	one pass algorithm	en_US
dc.title	Tamil news clustering using word embeddings	en_US
dc.type	Conference-Full-text	en_US

Collections

MERCon - 2020

Tamil news clustering using word embeddings

Files

Collections