Developing a retrieval-based Tamil language chatbot for closed domain

Kugathasan, K

Developing a retrieval-based Tamil language chatbot for closed domain

Files

TH5438-1.pdf (129.88 KB)

TH5438-2.pdf (93.92 KB)

TH5438.pdf (1.33 MB)

Date

2023

Authors

Kugathasan, K

Abstract

Chatbots are conversational systems that interact with humans via natural language. Frequently, it is used to respond to user queries and provide them with the information they need. To build a highly functional chatbot, a good corpus and a variety of language-related resources are required. Since Tamil is a low-resource language those resources are not available for Tamil. Additionally, since Tamil is also a morphologically rich language, high inflexion and free word order pose key challenges to Tamil chatbots. Due to all the above reasons, it is evident that developing an effective End-to-End chat system is challenging even for a closed domain. This study introduces a novel method for building a chatbot in Tamil by leveraging a dataset extracted from Tamil banking website’s FAQ sections and extending it to encompass the language's morphological complexity and rich inflectional structure. Intent is assigned to each query, and a multiclass intent classifier is developed to classify user intent. The CNN-based classifier demonstrated the highest performance, achieving an accuracy of 98.72%. While previous works on short-text classification in Tamil focused only on a few classes and used a very large dataset, our method produced a superior accuracy of over 98% using a small number of per-class examples even when there are 56 classes and additional challenges like class imbalance problem in the data. This shows our approach is better than any other approach for short text classification in Tamil. The major contribution of this research is the generation of the first-ever chat dataset for Tamil. Our research is the first of its kind in Tamil to show how an efficient context-less chatbot can be built using short text classification. Although this project is done for the Tamil language and for the Banking domain, this approach can be applied to other low-resourced languages and domains as well.

Keywords

CHATBOTS, NATURAL LANGUAGE PROCESSING, CONVERSATIONAL SYSTEMS, LOW-RESOURCE LANGUAGE, COMPUTER SCIENCE - Dissertation, COMPUTER SCIENCE & ENGINEERING - Dissertation, MSc (Major Component Research)

Citation

Kugathasan, K. (2023). Developing a retrieval-based Tamil language chatbot for closed domain [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/22970

URI

http://dl.lib.uom.lk/handle/123/22970

Collections

Master of Science By Research

Full item page

Developing a retrieval-based Tamil language chatbot for closed domain

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By