Translation of named entities between Sinhala and Tamil for official government documents

dc.contributor.advisorRanathunga, S
dc.contributor.advisorThayasivam, U
dc.contributor.authorMokanarangan, T
dc.date.accept2018-08
dc.date.accessioned2019-07-19T09:46:24Z
dc.date.available2019-07-19T09:46:24Z
dc.description.abstractAnalyzing existing machine translation approaches for Sinhala-Tamil official government documents have revealed the shortcomings when translating named entities. The diverse nature of the domain coupled with the lack of resources and morphological complexity are the key reasons for this problem. Our research focuses on translating named entities for official government documents between Tamil and Sinhala. In this research, we focus on identifying and translating named entities to improve the translation performance. We present a novel tag set specific to official government documents and also propose a graph-based semi-supervised approach that works better than state-of-the-art approaches for low-resource settings. We employed this approach to build a large annotated corpus in a cost-effective manner from a smaller amount of seed data and was able to build an annotated corpus of over 200K words each for Tamil and Sinhala. We also implemented a deep-learning approach for Named Entity Recognizer that gave the best output for a completed corpus. Since the deep-learning approach was a generic solution for sequential tagging, we also employed it to build a Part-of-Speech tagger that outperforms existing systems. The University of Moratuwa already has a system for translating official government documents called SiTa. Finally, we incorporated the aforementioned models to build a module that translated named entities and integrated it to SiTa. We empirically show that our modules improve over the baseline for Tamil ! Sinhala and Sinhala ! Tamil translation tasks by upto 0.5 and 1.4 BLEU scores, respectively.en_US
dc.identifier.accnoTH3689en_US
dc.identifier.degreeMaster of Science (By Research)en_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/14620
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING –Thesis, Dissertationsen_US
dc.subjectNAMED ENTITY RECOGNITIONen_US
dc.subjectMACHINE TRANSLATIONen_US
dc.subjectGRAPH–BASED SEMI-SUPERVISED LEARNINGen_US
dc.subjectDEEP LEARNINGen_US
dc.subjectNAMED ENTITY TRANSLATION
dc.titleTranslation of named entities between Sinhala and Tamil for official government documentsen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH3689-1.pdf
Size:
72.66 KB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH3689-2.pdf
Size:
471.12 KB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH3689.pdf
Size:
1.62 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis