The Impact of data cleaning and model selection for depression prediction: a comparative study
| dc.contributor.author | Munasinghe, SG | |
| dc.contributor.author | Thayasivam, U | |
| dc.contributor.editor | Gunawardena, S | |
| dc.date.accessioned | 2025-11-24T09:11:38Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Predicting health is a field where machine learning (ML) plays a crucial role [1], offering data-driven insights for early detection and diagnosis. With the increasing availability of internet-based data, there is potential for ML models to leverage digital footprints—such as social media activity, online surveys, and behavioral patterns—to predict mental health conditions like depression. However, while these advancements create new opportunities for mental health monitoring, the effectiveness of ML-based prediction largely depends on the quality of the underlying training data. [2] Real-world datasets often contain noise, inconsistencies, and mislabeled instances, making data preprocessing a critical step in predictive modeling. Prior research highlights that data cleaning significantly influences model generalization [3], but an important question remains: How to clean data to improve model performance without removing informative data. Furthermore, the choice of machine learning model plays a key role in handling such variations in data quality. We evaluate multiple classification models—K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and XGBoost—under different data preprocessing conditions to assess how cleaning affects their performance. By analyzing these models, we aim to provide insights into the relationship between data preprocessing, model robustness, and classification accuracy in mental health prediction.The rest of this paper is structured as follows: Section II discusses related work on data preprocessing and model selection. Section III details our methodology, including data sources, preprocessing steps, and model training. Section IV presents experimental results and analysis. Section V concludes the study and outlines future research directions. | |
| dc.identifier.conference | Applied Data Science & Artificial Intelligence (ADScAI) Symposium 2025 | |
| dc.identifier.department | Department of Computer Science & Engineering | |
| dc.identifier.doi | https://doi.org/10.31705/ADScAI.2025.10 | |
| dc.identifier.email | geesan.22@cse.mrt.ac.lk | |
| dc.identifier.email | rtuthaya@cse.mrt.ac.lk | |
| dc.identifier.faculty | Engineering | |
| dc.identifier.place | Moratuwa, Sri Lanka | |
| dc.identifier.proceeding | Proceedings of Applied Data Science & Artificial Intelligence Symposium 2025 | |
| dc.identifier.uri | https://dl.lib.uom.lk/handle/123/24461 | |
| dc.language.iso | en | |
| dc.publisher | Department of Computer Science and Engineering | |
| dc.subject | Data Cleaning | |
| dc.subject | Model Selection | |
| dc.subject | Classification | |
| dc.subject | Machine Learning | |
| dc.title | The Impact of data cleaning and model selection for depression prediction: a comparative study | |
| dc.type | Conference-Extended-Abstract |
