Neural machine translation for low-resourced languages: Sinhala &Tamil [abstract]

Loading...
Thumbnail Image

Date

2019

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Neural Machine Translation (NMT) has emerged as a cutting-edge technology, particularly impactful for resource-rich languages. However, its limitations in low-resource settings are addressed in this research, focusing on the Sinhala language in Sri Lanka. Despite Sinhala's prevalence, low English proficiency necessitates high-quality translations for official government documents. In Sri Lanka, Sinhala is the primary language, and the English competency of Sri Lankans is below average. Thus, translating English content to Sinhala has become an essential requirement. This study introduces an NMT system with Byte Pair Encoding (BPE), tailored for the English-Sinhala pair, emphasizing improved translation accuracy for Sri Lankan official documents. Beyond addressing NMT challenges, the research extends to the intricacies of low-resource, morphologically rich languages like Sinhala. While standard NMT surpasses Statistical Machine Translation with ample parallel corpus, low-resource languages face out-of-vocabulary (OOV) and rare word challenges. This research further investigated various sub-word techniques and empirically found that using sub-word techniques helps improve translation quality. This study uses a state-of-the-art English-Sinhala translation system with transformer architecture to explore sub-word techniques to alleviate OOV and rare word problems. Our experiments demonstrated how BPE can be incorporated to address the OOV problem in morphologically rich languages. Our models further demonstrate that sub word segmentation strategies and the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Description

The following papers were published based on the results of this research project. [1] Naranpanawa, R., Perera, R., Fonseka, T., & Thayasivam, U. (2020). Analyzing subword techniques toimprove english to sinhala neural machine translation. /nternational Journal of Asian Language Processing, 30(04), 2050017. hitps://doiorg/10. 1142/S2717554520500174 2] T. Fonseka, R. Naranpanawa, R. Perera and U. Thayasivam, "English to Sinhala Neural Machine Translation," 2020 International Conference on Asian Language Processing (IALP), Kuala Lumpur, Malaysia, 2020, pp. 305 -309, doi: 10.1109/IALP51396.2020.93 10462

Citation

DOI

Endorsement

Review

Supplemented By

Referenced By