Domain specific question and answer generation in Tamil

dc.contributor.advisorThayasivam U
dc.contributor.authorMurugathas R
dc.date.accept2022
dc.date.accessioned2022
dc.date.available2022
dc.date.issued2022
dc.description.abstractAutomatic Question-Answer generation is a challenging task in natural language processing. A system developed is capable of automatically generating questions and answers from history related text content in Tamil language input by user. The system processes the input text using various NLP techniques and generates questions and answers. The system has four modules namely, Preprocessing module, Rule-based module, Named Entity Recognition (NER) module, Question Answer Generator(QAG) module. Regex patterns and gazetteers are used in rule-based module and machine learning approach is used for NER module. NER module uses Conditional Random Fields (CRF) classifier built with features suitable for the domain and language. Dataset is collected from history textbooks and 23k word tokens are tagged using IOB2 format. Novel entity tag set specific to history domain are tagged. NLP techniques such as Sentence tokenization, POS tagging, Stemming, Unicode conversion uses existing python libraries. Features suitable for the domain and language selected are experimented with multiple combination. POS tag, stem word, gazetteer and clue words are features that contributes more for the performance. The best feature combination produced micro averaged Precision, Recall, F1-score of 87.9%, 67.1% and 76.1% respectively and accuracy of 89.6% on the test dataset. The NER module produced a better results despite the domain & language related challenges. Questions are formed using grammatical and defined rules from the named entities identified from rule-based and NER module. An affix stripping algorithm implemented to find the inflection suffix. A history text from Wikipedia is evaluated by 16 native Tamil speakers under categories such as undergraduates, graduates and experts. According to the evaluation results, 62.22% of total generated questions are grammatically correct and meaningful questions. Questions generated from Rulebased module produces better results compared to NER module.en_US
dc.identifier.accnoTH4932en_US
dc.identifier.citationMurugathas, R. (2022). Domain specific question and answer generation in Tamil [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22389
dc.identifier.degreeMSc in Computer Science & Engineeringen_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/22389
dc.language.isoenen_US
dc.subjectQUESTION AND ANSWER GENERATIONen_US
dc.subjectTAMILen_US
dc.subjectNERen_US
dc.subjectCRFen_US
dc.subjectHISTORYen_US
dc.subjectDOMAIN SPECIFICen_US
dc.subjectCOMPUTER SCIENCE & ENGINEERING - Dissertationen_US
dc.subjectCOMPUTER SCIENCE- Dissertationen_US
dc.titleDomain specific question and answer generation in Tamilen_US
dc.typeThesis-Abstracten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH4932-1.pdf
Size:
305.16 KB
Format:
Adobe Portable Document Format
Description:
Pre-Text
Loading...
Thumbnail Image
Name:
TH4932-2.pdf
Size:
175.21 KB
Format:
Adobe Portable Document Format
Description:
Post- Text
Loading...
Thumbnail Image
Name:
TH4932.pdf
Size:
3.25 MB
Format:
Adobe Portable Document Format
Description:
Full theses