Domain specific question and answer generation in Tamil

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Automatic Question-Answer generation is a challenging task in natural language processing. A system developed is capable of automatically generating questions and answers from history related text content in Tamil language input by user. The system processes the input text using various NLP techniques and generates questions and answers. The system has four modules namely, Preprocessing module, Rule-based module, Named Entity Recognition (NER) module, Question Answer Generator(QAG) module. Regex patterns and gazetteers are used in rule-based module and machine learning approach is used for NER module. NER module uses Conditional Random Fields (CRF) classifier built with features suitable for the domain and language. Dataset is collected from history textbooks and 23k word tokens are tagged using IOB2 format. Novel entity tag set specific to history domain are tagged. NLP techniques such as Sentence tokenization, POS tagging, Stemming, Unicode conversion uses existing python libraries. Features suitable for the domain and language selected are experimented with multiple combination. POS tag, stem word, gazetteer and clue words are features that contributes more for the performance. The best feature combination produced micro averaged Precision, Recall, F1-score of 87.9%, 67.1% and 76.1% respectively and accuracy of 89.6% on the test dataset. The NER module produced a better results despite the domain & language related challenges. Questions are formed using grammatical and defined rules from the named entities identified from rule-based and NER module. An affix stripping algorithm implemented to find the inflection suffix. A history text from Wikipedia is evaluated by 16 native Tamil speakers under categories such as undergraduates, graduates and experts. According to the evaluation results, 62.22% of total generated questions are grammatically correct and meaningful questions. Questions generated from Rulebased module produces better results compared to NER module.

Description

Citation

Murugathas, R. (2022). Domain specific question and answer generation in Tamil [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22389

DOI

Endorsement

Review

Supplemented By

Referenced By