Abstract:
Text classification is one of the core area of machine learning applications and it continuously improving with time. Hate speech classification in e-content became very popular over last few years and search engines, social media were the main interested parties of this area. “Racism content” is a sub part of the “Hate speech” area and many studies were carried out in this area. But surprisingly “Racism text classification in Sinhala language” is not very much popular. This study focus on building a machine learning pipeline for detecting racism comments written in Sinhala language. Sinhala racism text classification techniques can be used to remove unnecessary texts from social media, pages, blogs and many more text sources. This study was done to provide a solution to Sinhala racist text classification problems in social media, but also this study is valid for any text source that contain Sinhala text as Unicode text.
As the initial step, previous similar studies were reviewed and documented. Used techniques and results of similar studies were documented and reviewed. One similar study was selected as the baseline and set the baseline performance measures as the lower margin of the performance target. An architecture for the pipeline was designed and a methodology was selected as the next step. In this methodology, as the initial step, dataset was extracted and preprocessed. Stemming, stop word removal and conversion of text to basic character word were the main preprocessing steps. Features were extracted next and due to less number of racism data, an oversampling method was used to increase the training data. Six machine learning classifiers were selected and those were Random forest, Naïve bayes, SVM, Logistic regression, Ada boost and XGBoost. All the classifiers were trained with oversampled data. This was a major improvement point of the results. In order to increase the performance of these classifiers further, hyperparameter tuning was performed. Ensemble techniques were also used to increase the performance. As the ensemble techniques bagging, boosting and voting was used. After selecting the best classifiers from bagging and boosting, best three classifiers were set as the input to voting classifier to get the final results.
The study shows that each preprocessing steps improves the performance of classifiers and results. Each classifier behave differently in each step. This study highlights these differences and sensitivity of each classifier to various changes. In oversampling step and hyperparameter tuning step, all the classifiers reached to a stable level. Hyperparameter tuning identified as a critical step in Sinhala text classification. Finally the best results were shown by the voting classifier and selected as the best model in the study. The proposed pipeline performed better and highlighted the each classifiers performance differences. The pipeline was able to outperform the baseline with a greater accuracy. The study proves that the racism text classification is possible with selected classifiers and accurate data. Study also proves that a better Sinhala racism text classification pipeline can be built using other classifiers with the help of ensemble techniques and enhanced preprocessing techniques.
Citation:
Senadheera, C.P.B. (2021). An optimized machine learning pipeline for detecting racist comments written in Sinhala language [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa. http://dl.lib.uom.lk/handle/123/20480