dc.contributor.advisor	Jayasena,S
dc.contributor.advisor	Ranathunga S
dc.contributor.author	Thillainathan S
dc.date.accept	2022
dc.date.accessioned	2022
dc.date.available	2022
dc.date.issued	2022
dc.description.abstract	Limited parallel data is a major bottleneck for morphologically rich Low-Resource Languages (LRLs), resulting in Neural Machine Translation (NMT) systems of poor quality. Language representation learning in a self-supervised sequence-to-sequence fashion has become a new paradigm that utilizes the largely available monolingual data and alleviates the parallel data scarcity issue in NMT. The language pairs supported by the Self-supervised Multilingual Sequence-to-sequence Pre-trained (SMSP) model can be fine-tuned using this pre-trained model with a small amount of parallel data. This study shows the viability of fine-tuning such SMSP models for an extremely low-resource domain-specific NMT setting. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on SMSP models. To demonstrate, we select Sinhala, Tamil and English languages in extremely lowresource settings in the domain of official government documents. This research explores the ways to extend SMSP models to adapt to new domains and improve the fine-tuning process of SMSP models to obtain a high-quality translation in an extremely lowresource setting. We propose two novel approaches: (1) Continual Pre-training of the SMSP model in a self-supervised manner with domain-specific monolingual data to incorporate new domains and (2) multistage fine-tuning of the SMSP model with in- and out-domain parallel data. Our experiments with Sinhala (Si), Tamil (Ta) and English (En) show that directly fine-tuning (single-step) the SMSP model mBART for LRLs significantly outperforms state-of-the-art Transformer based NMT models in all language pairs in all six bilingual directions. We gain a +7.17 BLEU score on Si→En translation and a +6.74 BLEU score for the Ta→En direction. Most importantly, for non-English centric Si-Ta fine-tuning, we surpassed the state-of-the-art Transformer based NMT model by gaining a +4.11 BLEU score on Ta→Si and a +2.78 BLEU score on Si→Ta. Moreover, our proposed approaches improved performance strongly by around a +1 BLEU score compared to the strong single-step direct mBART fine-tuning for all six directions. At last, we propose a multi-model ensemble that improved the performance in all the cases where we obtained the overall best model with a +2 BLEU score improvement.	en_US
dc.identifier.accno	TH5032	en_US
dc.identifier.citation	Thillainathan, S. (2022). Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa.http://dl.lib.uom.lk/handle/123/21664
dc.identifier.degree	MSc In Computer Science and Engineering by Research	en_US
dc.identifier.department	Department of Computer Science and Engineering	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/21664
dc.language.iso	en	en_US
dc.subject	PRE-TRAINING	en_US
dc.subject	FINE-TUNING	en_US
dc.subject	LOW-RESOURCE LANGUAGES	en_US
dc.subject	MBART	en_US
dc.subject	PRE-TRAINED LANGUAGE MODELS	en_US
dc.subject	NEURAL MACHINE TRANSLATION	en_US
dc.subject	INFORMATION TECHNOLOGY -Dissertation	en_US
dc.subject	COMPUTER SCIENCE -Dissertation	en_US
dc.subject	COMPUTER SCIENCE & ENGINEERING -Dissertation	en_US
dc.title	Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation	en_US
dc.type	Thesis-Full-text	en_US

Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

Files

Original bundle

License bundle

Collections