Abstract:
The significant growth in the electronic media to store and exchange text documents has led to the use of tools, which analyze and categorize documents based on their content. The availability of Thesis-Full-text documents in electronic form emphasizes the need for intelligent information retrieval techniques. In Sri Lanka most of the public services use text documents written in the Sinhala language to provide their services. /As a result, there is a need for systems that can be used to semi-automatically analyze and process documents in Sinhala. Wide availability of electronic data has led to the vast interest in text analysis, information retrieval and text categorization methods. There are many concepts, approaches and techniques associated with text mining. Most of the widely available text categorization tools work only with English text. Therefore to provide a better service, there is a need for non-English based document analysis and categorizing systems, as is currently available for English text documents./
A tool that can automatically categorize a collection of Sinhala documents can be an asset to any service provider that deals with a large number of text documents in Sinhala. Data clustering can be used to categorize documents based on the content. The effectiveness of clustering depends on the feature extraction. The main techniques examined in this study include data pre-processing, feature extraction, and document clustering. The approach makes use of a transformation based on the text frequency and the inverse document frequency, which enhances the clustering performance. This approach is based on Latent Semantic Analysis. A text corpus categorized by human readers is utilized to test the validity of the suggested approach.
The technique introduced in this work enables the processing of text documents written in Sinhala, and empowers citizens and organizations to do their daily work efficiently.
Citation:
Meedeniya, D.A. (2008). Investigating the applicability of partition-based clustering for Sinhala documents [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/638