Show simple item record

dc.contributor.advisor Ranathunga S
dc.contributor.author Fayaza MSF
dc.date.accessioned 2020
dc.date.available 2020
dc.date.issued 2020
dc.identifier.uri http://dl.lib.uom.lk/handle/123/16509
dc.description.abstract The web has an abundance of online news articles that are updated frequently. Readers face difficulty in discovering content of interest from the overwhelming news sources and feel tired browsing various websites. This situation is valid in the case of Tamil online news as well, and the number of online news articles published in the Tamil language is on the rise. To address this issue, news aggregators and clustering techniques come into play. Even though there are many news aggregators available for languages like English, the only news aggregator that supports Tamil is Google news, which is a noticeable shortage. Google news mainly covers the Indian news and gives high weightage to the words that appear on the headline rather than those appearing in the body of the news when searching for the news [1]. This research focuses on clustering Tamil online news articles into related topics. There are several clustering techniques and similarity measures used to cluster the documents in the literature for other languages. Tamil is an agglutinative language, meaning that the techniques used for English documents might not readily work for Tamil. The purpose of this research is to study the techniques available for other languages and develop a mechanism to cluster the Tamil online news articles according to their content similarity. As the first step of this study, ten different datasets were created by collecting news from nine different news providers. Data was collected on nonadjacent days to get diversified data. TF-IDF and word embedding techniques were used to create vector representations of data. One pass algorithm and affinity propagation algorithm were used to cluster the news articles, since the number of clusters cannot be predefined and there is a high number of single news clusters. We achieved the best solution when applying word embedding with one pass algorithm. As another contribution of this research, we were able to create a Tamil word embedding model with 21,077,843 words. en_US
dc.language.iso en en_US
dc.subject COMPUTER SCIENCE- Dissertation en_US
dc.subject COMPUTER SCIENCE & ENGINEERING - Dissertation en_US
dc.subject CLUSTERING en_US
dc.subject IF-IDF en_US
dc.subject WORD EMBEDDING en_US
dc.subject ONE PASS ALGORITHM en_US
dc.subject AFFINITY PROPAGATION en_US
dc.subject COSINE SIMILARITY en_US
dc.subject CRAWLER en_US
dc.title Tamil news clustering en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc in Computer Science and Engineering en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.date.accept 2020
dc.identifier.accno TH4360 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record