Institutional-Repository, University of Moratuwa.  

Word level language identification of code-mixing text in social media using NLP

Show simple item record

dc.contributor.advisor Sumathipala S
dc.contributor.author Shanmugalingam K
dc.date.accessioned 2019
dc.date.available 2019
dc.date.issued 2019
dc.identifier.citation Shanmugalingam, K. (2019). Word level language identification of code-mixing text in social media using NLP [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/15810
dc.identifier.uri http://dl.lib.mrt.ac.lk/handle/123/15810
dc.description.abstract Automatic analyzing and extracting useful information from the noisy social media content are currently getting more attention from the research community. Recent days people easily mixing their native language along with the English language together to express their thoughts in social media, using the Unicode characters written in Roman Scripts. Thus these types of noisy code-mixed text are characterized by a high percentage of spelling mistakes with phonetic typing, wordplay, creative spelling, abbreviations, Meta tags, and so on. Identification of languages at word level become as necessary part for analyzing the noisy content in social media. It would be used as an intimidate language identifier for chatbot application by using the native languages. For this study used Tamil-English and Sinhala-English code-mixed text from social media. Natural Language Processing (NLP) and Machine Learning (ML) technologies used to identify the language tags at the word level. A novel approach proposed for this system implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency used for Tamil-English code-mixed text and features such as Sinhala Unicode characters written in Roman scripts, dictionaries, and term frequency used for Sinhala-English code-mixed text. Different machine learning classifiers such as Support Vector Machines (SVM), Naive Bayes, Logistic Regression, Random Forest and Decision Trees used in the model evaluation process. Ten-fold cross-validation used to evaluate the performance based on language tags at the word level. Among that the highest accuracy of 89.46% was obtained in SVM classifier and 90.5% was obtained in Random Forest classifier for Tamil-English (Tanglish) and Sinhala-English (Singlish) code-mixed text respectively. In the testing process of Tanglish model with SVM and Singlish model with Random Forest gave accuracy as 93.87% and 95.83% respectively for the testing unseen data. Tanglish model with SVM gave F-Measure for ‘tam’ and ‘eng’ tags were 0.965 and 0.894 respectively. Singlish model with Random Forest gave F-Measure for ‘sin’ and ‘eng’ tags were 0.975 and 0.929 respectively. So this the evidence that most of the times the Tanglish model with SVM and Singlish model with Random Forest predict the language labels correctly at word level. en_US
dc.language.iso en en_US
dc.subject COMPUTATIONAL MATHEMATICS-Dissertations en_US
dc.subject ARTIFICIAL INTELLIGENCE-Dissertations en_US
dc.subject NATURAL LANGUAGE PROCESSING en_US
dc.subject MACHINE LEARNING en_US
dc.subject MACHINE LEARNING-Support Vector Machines en_US
dc.subject SOCIAL MEDIA en_US
dc.subject SOCIAL MEDIA-Code-Mixed Text en_US
dc.subject ENGLISH LANGUAGE-Social Media en_US
dc.subject SINHALA LANGUAGE-Social Media en_US
dc.title Word level language identification of code-mixing text in social media using NLP en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty IT en_US
dc.identifier.degree MSc in Artificial Intelligence en_US
dc.identifier.department Department of Computational Mathematics en_US
dc.date.accept 2019
dc.identifier.accno TH3879 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record