Investigating the applicability of partition-based clustering for Sinhala documents

Meedeniya, DA

Investigating the applicability of partition-based clustering for Sinhala documents

Files

93369-1.pdf (299.35 KB)

93369-2.pdf (339.25 KB)

93369.pdf (19.82 MB)

Authors

Meedeniya, DA

Abstract

The significant growth in the electronic media to store and exchange text documents has led to the use of tools, which analyze and categorize documents based on their content. The availability of Thesis-Full-text documents in electronic form emphasizes the need for intelligent information retrieval techniques. In Sri Lanka most of the public services use text documents written in the Sinhala language to provide their services. /As a result, there is a need for systems that can be used to semi-automatically analyze and process documents in Sinhala. Wide availability of electronic data has led to the vast interest in text analysis, information retrieval and text categorization methods. There are many concepts, approaches and techniques associated with text mining. Most of the widely available text categorization tools work only with English text. Therefore to provide a better service, there is a need for non-English based document analysis and categorizing systems, as is currently available for English text documents./ A tool that can automatically categorize a collection of Sinhala documents can be an asset to any service provider that deals with a large number of text documents in Sinhala. Data clustering can be used to categorize documents based on the content. The effectiveness of clustering depends on the feature extraction. The main techniques examined in this study include data pre-processing, feature extraction, and document clustering. The approach makes use of a transformation based on the text frequency and the inverse document frequency, which enhances the clustering performance. This approach is based on Latent Semantic Analysis. A text corpus categorized by human readers is utilized to test the validity of the suggested approach. The technique introduced in this work enables the processing of text documents written in Sinhala, and empowers citizens and organizations to do their daily work efficiently.

Description

A Dissertation submitted to the Department of Computer Science and Engineering for the MSc in Computer Science specializing in Software Architecture ; CD ROM included

Keywords

COMPUTER SCIENCE AND ENGINEERING - Dissertation, COMPUTER SCIENCE - Dissertation, SINHALA LANGUAGE - Texts, SINHALA LANGUAGE - Information Retrieval

Citation

Meedeniya, D.A. (2008). Investigating the applicability of partition-based clustering for Sinhala documents [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/638

URI

http://dl.lib.mrt.ac.lk/handle/123/638

Collections

Master of Science in Computer science and Engineering

Full item page

Investigating the applicability of partition-based clustering for Sinhala documents

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By