Word level language identification of code-mixing text in social media using NLP

dc.contributor.advisorSumathipala S
dc.contributor.authorShanmugalingam K
dc.date.accept2019
dc.date.accessioned2019
dc.date.available2019
dc.date.issued2019
dc.description.abstractAutomatic analyzing and extracting useful information from the noisy social media content are currently getting more attention from the research community. Recent days people easily mixing their native language along with the English language together to express their thoughts in social media, using the Unicode characters written in Roman Scripts. Thus these types of noisy code-mixed text are characterized by a high percentage of spelling mistakes with phonetic typing, wordplay, creative spelling, abbreviations, Meta tags, and so on. Identification of languages at word level become as necessary part for analyzing the noisy content in social media. It would be used as an intimidate language identifier for chatbot application by using the native languages. For this study used Tamil-English and Sinhala-English code-mixed text from social media. Natural Language Processing (NLP) and Machine Learning (ML) technologies used to identify the language tags at the word level. A novel approach proposed for this system implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency used for Tamil-English code-mixed text and features such as Sinhala Unicode characters written in Roman scripts, dictionaries, and term frequency used for Sinhala-English code-mixed text. Different machine learning classifiers such as Support Vector Machines (SVM), Naive Bayes, Logistic Regression, Random Forest and Decision Trees used in the model evaluation process. Ten-fold cross-validation used to evaluate the performance based on language tags at the word level. Among that the highest accuracy of 89.46% was obtained in SVM classifier and 90.5% was obtained in Random Forest classifier for Tamil-English (Tanglish) and Sinhala-English (Singlish) code-mixed text respectively. In the testing process of Tanglish model with SVM and Singlish model with Random Forest gave accuracy as 93.87% and 95.83% respectively for the testing unseen data. Tanglish model with SVM gave F-Measure for ‘tam’ and ‘eng’ tags were 0.965 and 0.894 respectively. Singlish model with Random Forest gave F-Measure for ‘sin’ and ‘eng’ tags were 0.975 and 0.929 respectively. So this the evidence that most of the times the Tanglish model with SVM and Singlish model with Random Forest predict the language labels correctly at word level.en_US
dc.identifier.accnoTH3879en_US
dc.identifier.citationShanmugalingam, K. (2019). Word level language identification of code-mixing text in social media using NLP [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.mrt.ac.lk/handle/123/15810
dc.identifier.degreeMSc in Artificial Intelligenceen_US
dc.identifier.departmentDepartment of Computational Mathematicsen_US
dc.identifier.facultyITen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/15810
dc.language.isoenen_US
dc.subjectCOMPUTATIONAL MATHEMATICS-Dissertationsen_US
dc.subjectARTIFICIAL INTELLIGENCE-Dissertationsen_US
dc.subjectNATURAL LANGUAGE PROCESSINGen_US
dc.subjectMACHINE LEARNINGen_US
dc.subjectMACHINE LEARNING-Support Vector Machinesen_US
dc.subjectSOCIAL MEDIAen_US
dc.subjectSOCIAL MEDIA-Code-Mixed Texten_US
dc.subjectENGLISH LANGUAGE-Social Mediaen_US
dc.subjectSINHALA LANGUAGE-Social Mediaen_US
dc.titleWord level language identification of code-mixing text in social media using NLPen_US
dc.typeThesis-Full-texten_US

Files