Sinhala-English language detection in code-mixed data

dc.contributor.advisorThayasivam U
dc.contributor.authorSmith JRI
dc.date.accept2020
dc.date.accessioned2020
dc.date.available2020
dc.date.issued2020
dc.description.abstractText processing is a highly demanding research area in natural language processing domain in current context. The knowledge gathered using text processing is used in variety of other domains such as artificial intelligent, optical reading, chat bots and so on. On the other hand, language detection in text has also become a trending study due to the usage of multiple languages on the internet. Further, the language identification has become a difficult function in bilingual (mix of two languages) and multilingual (mix of more than two languages) data. Accordingly, this research presents a method to detect tokens written in Sinhala and English in code-mixed data. In addition to that, this is the first such study conducted on Sinhala-English code-mixed data as per the best of author’s knowledge at the time of this paper is prepared. To be precise, this is the first attempt to come up with a machine learning model on Sinhala-English code-mixed data written using Latin alphabetic characters. Indeed, if the code-mixed data is having Unicode characters, the language detection is straightforward and can be achieved using a simple Python program. However, when the whole sentence is presented in Latin characters, ambiguity increases, and it is not straightforward to detect the language and this study is a fine attempt to come up with a proper model to address this ambiguity. In practice, Sri Lankans use Sinhala words together with English in social media platforms for communication, review posting, commenting and so on. Further, there are many methods to detect Singlish words especially Unicode characters, yet the accuracy in these models in determining Sinhala tokens or English tokens in text data (code-mixed data) are questionable. Therefore, this study presents a language detection model using machine learning and natural language processing techniques. Accordingly, two models will be introduced to identify Sinhala-English code-mixed data gathered from social media platforms and another model to identify languages in word level using the state-of-the-art techniques. In addition, the dataset of Sinhala-English code-mixed data was published in ICTER 2019 [50] to be used for any similar studies and the final study was published in IALP 2019 held in China [51].en_US
dc.identifier.accnoTH4291en_US
dc.identifier.degreeMSc in Computer Science and Engineeringen_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/16490
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE – Dissertationsen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING - Dissertationsen_US
dc.subjectTEXT PROCESSINGen_US
dc.subjectNATURAL LANGUAGE PROCESSINGen_US
dc.subjectMULTI LANGUAGE LEARNINGen_US
dc.subjectMACHINE LEARNING -Sinhala-English Code-Mixed Dataen_US
dc.subjectUNICODE CHARACTERS- Singlishen_US
dc.subjectLANGUAGE DETECTIONen_US
dc.titleSinhala-English language detection in code-mixed dataen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH4291-1.pdf
Size:
165.29 KB
Format:
Adobe Portable Document Format
Description:
Pre- text
Loading...
Thumbnail Image
Name:
TH4291-2.pdf
Size:
88.16 KB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH4291.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis