SinLlama - a large language model for Sinhala
| dc.contributor.author | Aravinda, HWK | |
| dc.contributor.author | Sirajudeen, R | |
| dc.contributor.author | Karunathilake, S | |
| dc.contributor.author | De Silva, N | |
| dc.contributor.author | Kaur, R | |
| dc.contributor.author | Ranathunga, S | |
| dc.date.accessioned | 2025-12-09T09:10:01Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). Therefore, it is imperative that the existing LLMs are further trained to cover such languages. In this research, we extend an existing multilingual LLM (Llama-3-8B) to get a better coverage for Sinhala. We enhanced the LLM tokenizer with Sinhala specific vocabulary and performed continual pre-training on a 10 million sentence Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin. | |
| dc.identifier.conference | Moratuwa Engineering Research Conference 2025 | |
| dc.identifier.department | Engineering Research Unit, University of Moratuwa | |
| dc.identifier.email | aravinda.20@cse.mrt.ac.lk | |
| dc.identifier.email | rashad.20@cse.mrt.ac.lk | |
| dc.identifier.email | samith.20@cse.mrt.ac.lk | |
| dc.identifier.email | NisansaDds@cse.mrt.ac.lk | |
| dc.identifier.email | rishemjit.kaur@csio.res.in | |
| dc.identifier.email | s.ranathunga@massey.ac.nz | |
| dc.identifier.faculty | Engineering | |
| dc.identifier.isbn | 979-8-3315-6724-8 | |
| dc.identifier.pgnos | pp. 617-622 | |
| dc.identifier.proceeding | Proceedings of Moratuwa Engineering Research Conference 2025 | |
| dc.identifier.uri | https://dl.lib.uom.lk/handle/123/24550 | |
| dc.language.iso | en | |
| dc.subject | Sinhala | |
| dc.subject | low-resource languages | |
| dc.subject | large language models | |
| dc.subject | continual pretraining | |
| dc.subject | LLM | |
| dc.subject | Llama | |
| dc.subject | text classification | |
| dc.title | SinLlama - a large language model for Sinhala | |
| dc.type | Conference-Full-text |
