Using back-translation to improve domain-specific English-Sinhala neural machine translation

Loading...
Thumbnail Image

Date

2021

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Machine Translation (MT) is the automatic conversion of text in one language to other languages. Neural Machine Translation (NMT) is the state-of-the-art MT technique w builds an end-to-end neural model that generates an output sentence in a target language given a sentence in the source language as the input. NMT requires abundant parallel data to achieve good results. For low-resource settings such as Sinhala-English where parallel data is scarce, NMT tends to give sub-optimal results. This is severe when the translation is domain-specific. One solution for the data scarcity problem is data augmentation. To augment the parallel data for low-resource language pairs, commonly available large monolingual corpora can be used. A popular data augmentation technique is Back-Translation (BT). Over the years, there have been many techniques to improve vanilla BT. Prominent ones are Iterative BT, Filtering, Data Selection, and Tagged BT. Since these techniques have been rarely used on an inordinately low-resource language pair like Sinhala - English, we employ these techniques on this language pair for domain-specific translations in pursuance of improving the performance of Back-Translation. In particular, we move forward from previous research and show that by combining these different techniques, an even better result can be obtained. In addition to the aforementioned approaches, we also conducted an empirical evaluation of sentence embedding techniques (LASER, LaBSE, and FastText+VecMap) for the Sinhala-English language pair. Our best model provided a +3.24 BLEU score gain over the Baseline NMT model and a +2.17 BLEU score gain over the vanilla BT model for Sinhala → English translation. Furthermore, a +1.26 BLEU score gain over the Baseline NMT model and a +2.93 BLEU score gain over the vanilla BT model were observed for the best model for English → Sinhala translation.

Description

Keywords

NEURAL MACHINE TRANSLATION-English-Sinhala, LOW-RESOURCE LANGUAGES, BACK-TRANSLATION, DATA SELECTION, ITERATIVE BACK-TRANSLATION, ITERATIVE FILTERING, INFORMATION TECHNOLOGY -Dissertation, COMPUTER SCIENCE -Dissertation, COMPUTER SCIENCE & ENGINEERING -Dissertation

Citation

Epaliyana, K. (2021). Using back-translation to improve domain-specific English-Sinhala neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hthttp://dl.lib.uom.lk/handle/123/21665

DOI