Language detection in Sinhala English code mix data [abstract]

Thayasivam, U

Language detection in Sinhala English code mix data [abstract]

dc.contributor.author	Thayasivam, U
dc.date.accessioned	2025-07-23T04:58:59Z
dc.date.issued	2019
dc.description	The following papers were published based on the results of this research project. [1] I. Smith and U. Thayasivam, "Language Detection in Sinhala-English Code-mixed Data," 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 2019, pp. 228-233, doi: 10.1109/1ALP48816.2019.9037680
dc.description.abstract	Identifying languages in text data has become important due to the widespread use of multiple languages on the internet. Processing data with both Sinhala and English words, known as code-mixed data, poses a challenge. This research focuses on developing an effective method to detect Sinhala and English words in such code-mixed sentences, a novel approach at the time of this study. To achieve this goal, a new method is introduced, the first of its kind in this specific area of research. The dataset created for this study is also shared for the benefit of fellow researchers. While existing models handle Singlish Unicode characters well, there's a gap in identifying Sinhala words in sentences that include English words (code-mixed data). The outputs of this research are two models for language detection in code-mixed data. The first model, using an XGB classifier, achieves an accuracy of 92.1%. The second model, employing a Conditional Random Field (CRF) model, attains a notable Fl-score of 0.94 for sequence labelling. These models address a crucial need, providing reliable tools to distinguish Sinhala and English words in the complex landscape of code-mixed data. This research not only enhances our understanding of language detection but also contributes to the broader field of natural language processing in multilingual contexts.
dc.description.sponsorship	Senate Research Committee
dc.identifier.accno	SRC203
dc.identifier.srgno	SRC/ST/2019/49
dc.identifier.uri	https://dl.lib.uom.lk/handle/123/23913
dc.language.iso	en
dc.subject	SENATE RESEARCH COMMITTEE – Research Report
dc.subject	LANGUAGE DETECTION
dc.subject	SINHALA ENGLISH CODE MIX DATA
dc.subject	CODE MIX DATA
dc.title	Language detection in Sinhala English code mix data [abstract]
dc.type	SRC-Report

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SRC203 - Dr. T Uthayasanker SRCST201949 Cls.pdf
Size:: 985.96 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Senate Research Committee – Reports