Tamil news clustering using word embeddings
dc.contributor.author | Fayaza, MSF | |
dc.contributor.author | Ranathunga, S | |
dc.contributor.editor | Weeraddana, C | |
dc.contributor.editor | Edussooriya, CUS | |
dc.contributor.editor | Abeysooriya, RP | |
dc.date.accessioned | 2022-08-09T09:24:13Z | |
dc.date.available | 2022-08-09T09:24:13Z | |
dc.date.issued | 2020-07 | |
dc.description.abstract | News aggregators support the readers to view news from multiple news providers via a single point. At the moment, the only news aggregator that supports Tamil news is Google news, which has some noticeable shortages. In this study, Term Frequency–Inverse Document Frequency and word embedding (fastText) document representation techniques were experimented with one pass and affinity propagation clustering algorithms to news title, as well as title and body in order to implement a news aggregator for the Tamil language. For this study we collected data from nine different news providers. When fastText was applied with one pass algorithm to news title and body, it managed to beat other approaches to achieve an average pairwise F-score of 81% with respect to manual clustering. Also, we were able to create a Tamil fastText word embedding model using more than 21 million words. | en_US |
dc.identifier.citation | M. S. Faathima Fayaza and S. Ranathunga, "Tamil News Clustering Using Word Embeddings," 2020 Moratuwa Engineering Research Conference (MERCon), 2020, pp. 277-282, doi: 10.1109/MERCon50084.2020.9185282. | en_US |
dc.identifier.conference | Moratuwa Engineering Research Conference 2020 | en_US |
dc.identifier.department | Engineering Research Unit, University of Moratuwa | en_US |
dc.identifier.doi | 10.1109/MERCon50084.2020.9185282 | en_US |
dc.identifier.email | msf.fayaza89@gmail.com | en_US |
dc.identifier.email | surangika@cse.mrt.ac.lk | en_US |
dc.identifier.faculty | Engineering | en_US |
dc.identifier.pgnos | pp. 277-282 | en_US |
dc.identifier.place | Moratuwa, Sri Lanka | en_US |
dc.identifier.proceeding | Proceedings of Moratuwa Engineering Research Conference 2020 | en_US |
dc.identifier.uri | http://dl.lib.uom.lk/handle/123/18580 | |
dc.identifier.year | 2020 | en_US |
dc.language.iso | en | en_US |
dc.publisher | IEEE | en_US |
dc.relation.uri | https://ieeexplore.ieee.org/document/9185282 | en_US |
dc.subject | document clustering | en_US |
dc.subject | Tamil | en_US |
dc.subject | word embedding | en_US |
dc.subject | Term Frequency–Inverse Document Frequency | en_US |
dc.subject | affinity propagation clustering | en_US |
dc.subject | one pass algorithm | en_US |
dc.title | Tamil news clustering using word embeddings | en_US |
dc.type | Conference-Full-text | en_US |