Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

dc.contributor.advisorJayasena,S
dc.contributor.advisorRanathunga S
dc.contributor.authorThillainathan S
dc.date.accept2022
dc.date.accessioned2022
dc.date.available2022
dc.date.issued2022
dc.description.abstractLimited parallel data is a major bottleneck for morphologically rich Low-Resource Languages (LRLs), resulting in Neural Machine Translation (NMT) systems of poor quality. Language representation learning in a self-supervised sequence-to-sequence fashion has become a new paradigm that utilizes the largely available monolingual data and alleviates the parallel data scarcity issue in NMT. The language pairs supported by the Self-supervised Multilingual Sequence-to-sequence Pre-trained (SMSP) model can be fine-tuned using this pre-trained model with a small amount of parallel data. This study shows the viability of fine-tuning such SMSP models for an extremely low-resource domain-specific NMT setting. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on SMSP models. To demonstrate, we select Sinhala, Tamil and English languages in extremely lowresource settings in the domain of official government documents. This research explores the ways to extend SMSP models to adapt to new domains and improve the fine-tuning process of SMSP models to obtain a high-quality translation in an extremely lowresource setting. We propose two novel approaches: (1) Continual Pre-training of the SMSP model in a self-supervised manner with domain-specific monolingual data to incorporate new domains and (2) multistage fine-tuning of the SMSP model with in- and out-domain parallel data. Our experiments with Sinhala (Si), Tamil (Ta) and English (En) show that directly fine-tuning (single-step) the SMSP model mBART for LRLs significantly outperforms state-of-the-art Transformer based NMT models in all language pairs in all six bilingual directions. We gain a +7.17 BLEU score on Si→En translation and a +6.74 BLEU score for the Ta→En direction. Most importantly, for non-English centric Si-Ta fine-tuning, we surpassed the state-of-the-art Transformer based NMT model by gaining a +4.11 BLEU score on Ta→Si and a +2.78 BLEU score on Si→Ta. Moreover, our proposed approaches improved performance strongly by around a +1 BLEU score compared to the strong single-step direct mBART fine-tuning for all six directions. At last, we propose a multi-model ensemble that improved the performance in all the cases where we obtained the overall best model with a +2 BLEU score improvement.en_US
dc.identifier.accnoTH5032en_US
dc.identifier.citationThillainathan, S. (2022). Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa.http://dl.lib.uom.lk/handle/123/21664
dc.identifier.degreeMSc In Computer Science and Engineering by Researchen_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/21664
dc.language.isoenen_US
dc.subjectPRE-TRAININGen_US
dc.subjectFINE-TUNINGen_US
dc.subjectLOW-RESOURCE LANGUAGESen_US
dc.subjectMBARTen_US
dc.subjectPRE-TRAINED LANGUAGE MODELSen_US
dc.subjectNEURAL MACHINE TRANSLATIONen_US
dc.subjectINFORMATION TECHNOLOGY -Dissertationen_US
dc.subjectCOMPUTER SCIENCE -Dissertationen_US
dc.subjectCOMPUTER SCIENCE & ENGINEERING -Dissertationen_US
dc.titlePre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translationen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH5032-1.pdf
Size:
134.24 KB
Format:
Adobe Portable Document Format
Description:
Pre-Text
Loading...
Thumbnail Image
Name:
TH5032-2.pdf
Size:
107.47 KB
Format:
Adobe Portable Document Format
Description:
Post-Text
Loading...
Thumbnail Image
Name:
TH5032.pdf
Size:
1.4 MB
Format:
Adobe Portable Document Format
Description:
Full-theses

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: