Language detection in Sinhala English code mix data [abstract]

dc.contributor.authorThayasivam, U
dc.date.accessioned2025-07-23T04:58:59Z
dc.date.issued2019
dc.descriptionThe following papers were published based on the results of this research project. [1] I. Smith and U. Thayasivam, "Language Detection in Sinhala-English Code-mixed Data," 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 2019, pp. 228-233, doi: 10.1109/1ALP48816.2019.9037680
dc.description.abstractIdentifying languages in text data has become important due to the widespread use of multiple languages on the internet. Processing data with both Sinhala and English words, known as code-mixed data, poses a challenge. This research focuses on developing an effective method to detect Sinhala and English words in such code-mixed sentences, a novel approach at the time of this study. To achieve this goal, a new method is introduced, the first of its kind in this specific area of research. The dataset created for this study is also shared for the benefit of fellow researchers. While existing models handle Singlish Unicode characters well, there's a gap in identifying Sinhala words in sentences that include English words (code-mixed data). The outputs of this research are two models for language detection in code-mixed data. The first model, using an XGB classifier, achieves an accuracy of 92.1%. The second model, employing a Conditional Random Field (CRF) model, attains a notable Fl-score of 0.94 for sequence labelling. These models address a crucial need, providing reliable tools to distinguish Sinhala and English words in the complex landscape of code-mixed data. This research not only enhances our understanding of language detection but also contributes to the broader field of natural language processing in multilingual contexts.
dc.description.sponsorshipSenate Research Committee
dc.identifier.accnoSRC203
dc.identifier.srgnoSRC/ST/2019/49
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/23913
dc.language.isoen
dc.subjectSENATE RESEARCH COMMITTEE – Research Report
dc.subjectLANGUAGE DETECTION
dc.subjectSINHALA ENGLISH CODE MIX DATA
dc.subjectCODE MIX DATA
dc.titleLanguage detection in Sinhala English code mix data [abstract]
dc.typeSRC-Report

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SRC203 - Dr. T Uthayasanker SRCST201949 Cls.pdf
Size:
985.96 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: