Domain specific question and answer generation in Tamil

Murugathas R

UoM IR
→
Thesis & Dissertation
→
Faculty of Engineering, Computer Science & Engineering
→
Master of Science in Computer science and Engineering
→
View Item

dc.contributor.advisor	Thayasivam U
dc.contributor.author	Murugathas R
dc.date.accessioned	2022
dc.date.available	2022
dc.date.issued	2022
dc.identifier.citation	Murugathas, R. (2022). Domain specific question and answer generation in Tamil [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22389
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/22389
dc.description.abstract	Automatic Question-Answer generation is a challenging task in natural language processing. A system developed is capable of automatically generating questions and answers from history related text content in Tamil language input by user. The system processes the input text using various NLP techniques and generates questions and answers. The system has four modules namely, Preprocessing module, Rule-based module, Named Entity Recognition (NER) module, Question Answer Generator(QAG) module. Regex patterns and gazetteers are used in rule-based module and machine learning approach is used for NER module. NER module uses Conditional Random Fields (CRF) classifier built with features suitable for the domain and language. Dataset is collected from history textbooks and 23k word tokens are tagged using IOB2 format. Novel entity tag set specific to history domain are tagged. NLP techniques such as Sentence tokenization, POS tagging, Stemming, Unicode conversion uses existing python libraries. Features suitable for the domain and language selected are experimented with multiple combination. POS tag, stem word, gazetteer and clue words are features that contributes more for the performance. The best feature combination produced micro averaged Precision, Recall, F1-score of 87.9%, 67.1% and 76.1% respectively and accuracy of 89.6% on the test dataset. The NER module produced a better results despite the domain & language related challenges. Questions are formed using grammatical and defined rules from the named entities identified from rule-based and NER module. An affix stripping algorithm implemented to find the inflection suffix. A history text from Wikipedia is evaluated by 16 native Tamil speakers under categories such as undergraduates, graduates and experts. According to the evaluation results, 62.22% of total generated questions are grammatically correct and meaningful questions. Questions generated from Rulebased module produces better results compared to NER module.	en_US
dc.language.iso	en	en_US
dc.subject	QUESTION AND ANSWER GENERATION	en_US
dc.subject	TAMIL	en_US
dc.subject	NER	en_US
dc.subject	CRF	en_US
dc.subject	HISTORY	en_US
dc.subject	DOMAIN SPECIFIC	en_US
dc.subject	COMPUTER SCIENCE & ENGINEERING - Dissertation	en_US
dc.subject	COMPUTER SCIENCE- Dissertation	en_US
dc.title	Domain specific question and answer generation in Tamil	en_US
dc.type	Thesis-Abstract	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.degree	MSc in Computer Science & Engineering	en_US
dc.identifier.department	Department of Computer Science & Engineering	en_US
dc.date.accept	2022
dc.identifier.accno	TH4932	en_US