Show simple item record

dc.contributor.advisor Thayasivam U
dc.contributor.author Murugathas R
dc.date.accessioned 2022
dc.date.available 2022
dc.date.issued 2022
dc.identifier.citation Murugathas, R. (2022). Domain specific question and answer generation in Tamil [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22389
dc.identifier.uri http://dl.lib.uom.lk/handle/123/22389
dc.description.abstract Automatic Question-Answer generation is a challenging task in natural language processing. A system developed is capable of automatically generating questions and answers from history related text content in Tamil language input by user. The system processes the input text using various NLP techniques and generates questions and answers. The system has four modules namely, Preprocessing module, Rule-based module, Named Entity Recognition (NER) module, Question Answer Generator(QAG) module. Regex patterns and gazetteers are used in rule-based module and machine learning approach is used for NER module. NER module uses Conditional Random Fields (CRF) classifier built with features suitable for the domain and language. Dataset is collected from history textbooks and 23k word tokens are tagged using IOB2 format. Novel entity tag set specific to history domain are tagged. NLP techniques such as Sentence tokenization, POS tagging, Stemming, Unicode conversion uses existing python libraries. Features suitable for the domain and language selected are experimented with multiple combination. POS tag, stem word, gazetteer and clue words are features that contributes more for the performance. The best feature combination produced micro averaged Precision, Recall, F1-score of 87.9%, 67.1% and 76.1% respectively and accuracy of 89.6% on the test dataset. The NER module produced a better results despite the domain & language related challenges. Questions are formed using grammatical and defined rules from the named entities identified from rule-based and NER module. An affix stripping algorithm implemented to find the inflection suffix. A history text from Wikipedia is evaluated by 16 native Tamil speakers under categories such as undergraduates, graduates and experts. According to the evaluation results, 62.22% of total generated questions are grammatically correct and meaningful questions. Questions generated from Rulebased module produces better results compared to NER module. en_US
dc.language.iso en en_US
dc.subject QUESTION AND ANSWER GENERATION en_US
dc.subject TAMIL en_US
dc.subject NER en_US
dc.subject CRF en_US
dc.subject HISTORY en_US
dc.subject DOMAIN SPECIFIC en_US
dc.subject COMPUTER SCIENCE & ENGINEERING - Dissertation en_US
dc.subject COMPUTER SCIENCE- Dissertation en_US
dc.title Domain specific question and answer generation in Tamil en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc in Computer Science & Engineering en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2022
dc.identifier.accno TH4932 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record