Transliteration and byte pair encoding to improve Tamil to Sinhala neural machine translation

dc.contributor.authorTennage, P
dc.contributor.authorHerath, A
dc.contributor.authorThilakarathne, M
dc.contributor.authorSandaruwan, P
dc.contributor.authorRanathunga, S
dc.contributor.editorChathuranga, D
dc.date.accessioned2022-08-24T04:50:31Z
dc.date.available2022-08-24T04:50:31Z
dc.date.issued2018-05
dc.description.abstractNeural Machine Translation (NMT) is the current state-of-the-art machine translation technique. However, applicability of NMT for language pairs that have high morphological variations is still debatable. Lack of language resources, especially a sufficiently large parallel corpus causes additional issues, which leads to very poor translation performance, when NMT is applied to languages with high morphological variations. In this paper, we present three techniques to improve domain-specific NMT performance of the under-resourced language pair Sinhala and Tamil that have high morphological variations. Out of these three techniques, transliteration is a novel approach to improve domain-specific NMT performance for language pairs such as Sinhala and Tamil that share a common grammatical structure and have moderate lexical similarity. We built the first transliteration system for Sinhala to English and Tamil to English, which provided an accuracy of 99.6%, when tested with the parallel corpus we used for NMT training. The other technique we employed is Byte Pair Encoding (BPE), which is a technique that has been used to achieve open vocabulary translation with a fixed vocabulary of subword symbols. Our experiments show that while the translation based on independent BPE models and pure transliteration perform moderately, integrating transliteration to build a joint BPE model for the aforementioned language pair increases the translation quality by 1.68 BLEU score.en_US
dc.identifier.citationP. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan and S. Ranathunga, "Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation," 2018 Moratuwa Engineering Research Conference (MERCon), 2018, pp. 390-395, doi: 10.1109/MERCon.2018.8421939.en_US
dc.identifier.conference2018 Moratuwa Engineering Research Conference (MERCon)en_US
dc.identifier.departmentEngineering Research Unit, University of Moratuwaen_US
dc.identifier.doi10.1109/MERCon.2018.8421939en_US
dc.identifier.emailpasindu.13@cse.mrt.ac.lken_US
dc.identifier.emailnarmada.ah.13@cse.mrt.ac.lken_US
dc.identifier.emailmalith.13@cse.mrt.ac.lken_US
dc.identifier.emailprabath.sandaruwan.13@cse.mrt.ac.lken_US
dc.identifier.emailsurangika@cse.mrt.ac.lken_US
dc.identifier.facultyEngineering
dc.identifier.pgnospp. 390-395en_US
dc.identifier.proceedingProceedings of 2018 Moratuwa Engineering Research Conference (MERCon)en_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/18694
dc.identifier.year2018en_US
dc.language.isoenen_US
dc.publisherIEEEen_US
dc.relation.urihttps://ieeexplore.ieee.org/document/8421939en_US
dc.subjectneural machine translationen_US
dc.subjecttransliterationen_US
dc.subjectbyte pair encodingen_US
dc.subjectsinhalaen_US
dc.subjecttamilen_US
dc.titleTransliteration and byte pair encoding to improve Tamil to Sinhala neural machine translationen_US
dc.typeConference-Full-texten_US

Files

Collections