Abstract:
Neural Machine Translation (NMT) is the current
state-of-the-art machine translation technique. However,
applicability of NMT for language pairs that have high
morphological variations is still debatable. Lack of language
resources, especially a sufficiently large parallel corpus causes
additional issues, which leads to very poor translation
performance, when NMT is applied to languages with high
morphological variations. In this paper, we present three
techniques to improve domain-specific NMT performance of the
under-resourced language pair Sinhala and Tamil that have high
morphological variations. Out of these three techniques,
transliteration is a novel approach to improve domain-specific
NMT performance for language pairs such as Sinhala and Tamil
that share a common grammatical structure and have moderate
lexical similarity. We built the first transliteration system for
Sinhala to English and Tamil to English, which provided an
accuracy of 99.6%, when tested with the parallel corpus we used
for NMT training. The other technique we employed is Byte Pair
Encoding (BPE), which is a technique that has been used to
achieve open vocabulary translation with a fixed vocabulary of
subword symbols. Our experiments show that while the
translation based on independent BPE models and pure
transliteration perform moderately, integrating transliteration to
build a joint BPE model for the aforementioned language pair
increases the translation quality by 1.68 BLEU score.
Citation:
P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan and S. Ranathunga, "Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation," 2018 Moratuwa Engineering Research Conference (MERCon), 2018, pp. 390-395, doi: 10.1109/MERCon.2018.8421939.