SinLlama - a large language model for Sinhala

dc.contributor.authorAravinda, HWK
dc.contributor.authorSirajudeen, R
dc.contributor.authorKarunathilake, S
dc.contributor.authorDe Silva, N
dc.contributor.authorKaur, R
dc.contributor.authorRanathunga, S
dc.date.accessioned2025-12-09T09:10:01Z
dc.date.issued2025
dc.description.abstractLow-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). Therefore, it is imperative that the existing LLMs are further trained to cover such languages. In this research, we extend an existing multilingual LLM (Llama-3-8B) to get a better coverage for Sinhala. We enhanced the LLM tokenizer with Sinhala specific vocabulary and performed continual pre-training on a 10 million sentence Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
dc.identifier.conferenceMoratuwa Engineering Research Conference 2025
dc.identifier.departmentEngineering Research Unit, University of Moratuwa
dc.identifier.emailaravinda.20@cse.mrt.ac.lk
dc.identifier.emailrashad.20@cse.mrt.ac.lk
dc.identifier.emailsamith.20@cse.mrt.ac.lk
dc.identifier.emailNisansaDds@cse.mrt.ac.lk
dc.identifier.emailrishemjit.kaur@csio.res.in
dc.identifier.emails.ranathunga@massey.ac.nz
dc.identifier.facultyEngineering
dc.identifier.isbn979-8-3315-6724-8
dc.identifier.pgnospp. 617-622
dc.identifier.proceedingProceedings of Moratuwa Engineering Research Conference 2025
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/24550
dc.language.isoen
dc.subjectSinhala
dc.subjectlow-resource languages
dc.subjectlarge language models
dc.subjectcontinual pretraining
dc.subjectLLM
dc.subjectLlama
dc.subjecttext classification
dc.titleSinLlama - a large language model for Sinhala
dc.typeConference-Full-text

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1571154281.pdf
Size:
1.44 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections