Language detection in Sinhala English code mix data [abstract]

Loading...
Thumbnail Image

Date

2019

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Identifying languages in text data has become important due to the widespread use of multiple languages on the internet. Processing data with both Sinhala and English words, known as code-mixed data, poses a challenge. This research focuses on developing an effective method to detect Sinhala and English words in such code-mixed sentences, a novel approach at the time of this study. To achieve this goal, a new method is introduced, the first of its kind in this specific area of research. The dataset created for this study is also shared for the benefit of fellow researchers. While existing models handle Singlish Unicode characters well, there's a gap in identifying Sinhala words in sentences that include English words (code-mixed data). The outputs of this research are two models for language detection in code-mixed data. The first model, using an XGB classifier, achieves an accuracy of 92.1%. The second model, employing a Conditional Random Field (CRF) model, attains a notable Fl-score of 0.94 for sequence labelling. These models address a crucial need, providing reliable tools to distinguish Sinhala and English words in the complex landscape of code-mixed data. This research not only enhances our understanding of language detection but also contributes to the broader field of natural language processing in multilingual contexts.

Description

The following papers were published based on the results of this research project. [1] I. Smith and U. Thayasivam, "Language Detection in Sinhala-English Code-mixed Data," 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 2019, pp. 228-233, doi: 10.1109/1ALP48816.2019.9037680

Citation

DOI

Endorsement

Review

Supplemented By

Referenced By