Show simple item record

dc.contributor.advisor Thayasivam U
dc.contributor.author Smith JRI
dc.date.accessioned 2020
dc.date.available 2020
dc.date.issued 2020
dc.identifier.uri http://dl.lib.uom.lk/handle/123/16490
dc.description.abstract Text processing is a highly demanding research area in natural language processing domain in current context. The knowledge gathered using text processing is used in variety of other domains such as artificial intelligent, optical reading, chat bots and so on. On the other hand, language detection in text has also become a trending study due to the usage of multiple languages on the internet. Further, the language identification has become a difficult function in bilingual (mix of two languages) and multilingual (mix of more than two languages) data. Accordingly, this research presents a method to detect tokens written in Sinhala and English in code-mixed data. In addition to that, this is the first such study conducted on Sinhala-English code-mixed data as per the best of author’s knowledge at the time of this paper is prepared. To be precise, this is the first attempt to come up with a machine learning model on Sinhala-English code-mixed data written using Latin alphabetic characters. Indeed, if the code-mixed data is having Unicode characters, the language detection is straightforward and can be achieved using a simple Python program. However, when the whole sentence is presented in Latin characters, ambiguity increases, and it is not straightforward to detect the language and this study is a fine attempt to come up with a proper model to address this ambiguity. In practice, Sri Lankans use Sinhala words together with English in social media platforms for communication, review posting, commenting and so on. Further, there are many methods to detect Singlish words especially Unicode characters, yet the accuracy in these models in determining Sinhala tokens or English tokens in text data (code-mixed data) are questionable. Therefore, this study presents a language detection model using machine learning and natural language processing techniques. Accordingly, two models will be introduced to identify Sinhala-English code-mixed data gathered from social media platforms and another model to identify languages in word level using the state-of-the-art techniques. In addition, the dataset of Sinhala-English code-mixed data was published in ICTER 2019 [50] to be used for any similar studies and the final study was published in IALP 2019 held in China [51]. en_US
dc.language.iso en en_US
dc.subject COMPUTER SCIENCE – Dissertations en_US
dc.subject COMPUTER SCIENCE AND ENGINEERING - Dissertations en_US
dc.subject TEXT PROCESSING en_US
dc.subject NATURAL LANGUAGE PROCESSING en_US
dc.subject MULTI LANGUAGE LEARNING en_US
dc.subject MACHINE LEARNING -Sinhala-English Code-Mixed Data en_US
dc.subject UNICODE CHARACTERS- Singlish en_US
dc.subject LANGUAGE DETECTION en_US
dc.title Sinhala-English language detection in code-mixed data en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc in Computer Science and Engineering en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.date.accept 2020
dc.identifier.accno TH4291 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record