Word level language identification of code mixing text in social media using nlp

Shanmugalingam, K; Sumathipala, S; Premachandra, C

Word level language identification of code mixing text in social media using nlp

dc.contributor.author	Shanmugalingam, K
dc.contributor.author	Sumathipala, S
dc.contributor.author	Premachandra, C
dc.contributor.editor	Wijesiriwardana, CP
dc.date.accessioned	2022-12-05T05:41:26Z
dc.date.available	2022-12-05T05:41:26Z
dc.date.issued	2018
dc.description.abstract	Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.	en_US
dc.identifier.citation	K. Shanmugalingam, S. Sumathipala and C. Premachandra, "Word Level Language Identification of Code Mixing Text in Social Media using NLP," 2018 3rd International Conference on Information Technology Research (ICITR), 2018, pp. 1-5, doi: 10.1109/ICITR.2018.8736127.	en_US
dc.identifier.conference	3rd International Conference on Information Technology Research 2018	en_US
dc.identifier.department	Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa.	en_US
dc.identifier.doi	doi: 10.1109/ICITR.2018.8736127	en_US
dc.identifier.email	s.shanshiya@gmail.com	en_US
dc.identifier.email	sagaras@uom.lk	en_US
dc.identifier.email	chintaka@shibaura-it.ac.jp	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.proceeding	Proceedings of the 3rd International Conference in Information Technology Research 2018	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/19646
dc.identifier.year	2018	en_US
dc.language.iso	en	en_US
dc.publisher	Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa, Sri Lanka	en_US
dc.relation.uri	https://ieeexplore.ieee.org/document/8736127	en_US
dc.subject	Code-mixing	en_US
dc.subject	NLP	en_US
dc.subject	Machine learning	en_US
dc.subject	language identification	en_US
dc.title	Word level language identification of code mixing text in social media using nlp	en_US
dc.type	Conference-Full-text	en_US

Collections

ICITR - 2018

Word level language identification of code mixing text in social media using nlp

Files

Collections