Using back-translation to improve domain-specific English-Sinhala neural machine translation

dc.contributor.advisorRanathunga S
dc.contributor.advisorJayasena S
dc.contributor.authorEpaliyana K
dc.date.accept2021
dc.date.accessioned2021
dc.date.available2021
dc.date.issued2021
dc.description.abstractMachine Translation (MT) is the automatic conversion of text in one language to other languages. Neural Machine Translation (NMT) is the state-of-the-art MT technique w builds an end-to-end neural model that generates an output sentence in a target language given a sentence in the source language as the input. NMT requires abundant parallel data to achieve good results. For low-resource settings such as Sinhala-English where parallel data is scarce, NMT tends to give sub-optimal results. This is severe when the translation is domain-specific. One solution for the data scarcity problem is data augmentation. To augment the parallel data for low-resource language pairs, commonly available large monolingual corpora can be used. A popular data augmentation technique is Back-Translation (BT). Over the years, there have been many techniques to improve vanilla BT. Prominent ones are Iterative BT, Filtering, Data Selection, and Tagged BT. Since these techniques have been rarely used on an inordinately low-resource language pair like Sinhala - English, we employ these techniques on this language pair for domain-specific translations in pursuance of improving the performance of Back-Translation. In particular, we move forward from previous research and show that by combining these different techniques, an even better result can be obtained. In addition to the aforementioned approaches, we also conducted an empirical evaluation of sentence embedding techniques (LASER, LaBSE, and FastText+VecMap) for the Sinhala-English language pair. Our best model provided a +3.24 BLEU score gain over the Baseline NMT model and a +2.17 BLEU score gain over the vanilla BT model for Sinhala → English translation. Furthermore, a +1.26 BLEU score gain over the Baseline NMT model and a +2.93 BLEU score gain over the vanilla BT model were observed for the best model for English → Sinhala translation.
dc.identifier.accnoTH5033en_US
dc.identifier.citationEpaliyana, K. (2021). Using back-translation to improve domain-specific English-Sinhala neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hthttp://dl.lib.uom.lk/handle/123/21665
dc.identifier.degreeMSc In Computer Science and Engineering by Researchen_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/21665
dc.language.isoenen_US
dc.subjectNEURAL MACHINE TRANSLATION-English-Sinhala
dc.subjectLOW-RESOURCE LANGUAGES
dc.subjectBACK-TRANSLATION
dc.subjectDATA SELECTION
dc.subjectITERATIVE BACK-TRANSLATION
dc.subjectITERATIVE FILTERING
dc.subjectINFORMATION TECHNOLOGY -Dissertation
dc.subjectCOMPUTER SCIENCE -Dissertation
dc.subjectCOMPUTER SCIENCE & ENGINEERING -Dissertation
dc.titleUsing back-translation to improve domain-specific English-Sinhala neural machine translationen_US
dc.typeThesis-Full-text

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH5033-1.pdf
Size:
192.9 KB
Format:
Adobe Portable Document Format
Description:
Pre-Text
Loading...
Thumbnail Image
Name:
TH5033-2.pdf.txt
Size:
98.44 KB
Format:
Adobe Portable Document Format
Description:
Post-Text
Loading...
Thumbnail Image
Name:
TH5033.pdf
Size:
2.1 MB
Format:
Adobe Portable Document Format
Description:
Full-theses

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: