Utilizing multilingual encoders to improve large language models for low-resource languages

dc.contributor.authorPuranegedara, I
dc.contributor.authorChathumina, T
dc.contributor.authorRanathunga, N
dc.contributor.authorDe Silva, N
dc.contributor.authorRanathunga, S
dc.contributor.authorThayaparan, M
dc.date.accessioned2025-12-09T08:50:45Z
dc.date.issued2025
dc.description.abstractLarge Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While there are methods to align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the existing baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs1.
dc.identifier.conferenceMoratuwa Engineering Research Conference 2025
dc.identifier.departmentEngineering Research Unit, University of Moratuwa
dc.identifier.emailimalsha.20@cse.mrt.ac.lk
dc.identifier.emailthemira.20@cse.mrt.ac.lk
dc.identifier.emailnisal.20@cse.mrt.ac.lk
dc.identifier.emailNisansaDds@cse.mrt.ac.lk
dc.identifier.emails.ranathunga@massey.ac.nz
dc.identifier.emailmokanarangan.thayaparan@open.ac.uk
dc.identifier.facultyEngineering
dc.identifier.isbn979-8-3315-6724-8
dc.identifier.pgnospp. 641-646
dc.identifier.proceedingProceedings of Moratuwa Engineering Research Conference 2025
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/24546
dc.language.isoen
dc.publisherIEEE
dc.subjectLarge Language Models
dc.subjectMultilingual Understanding
dc.subjectZero-Shot Transfer
dc.subjectLayer Fusion
dc.subjectLow-Resource Languages
dc.titleUtilizing multilingual encoders to improve large language models for low-resource languages
dc.typeConference-Full-text

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1571154310.pdf
Size:
1.49 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections