The Impact of data cleaning and model selection for depression prediction: a comparative study

Munasinghe, SG; Thayasivam, U

The Impact of data cleaning and model selection for depression prediction: a comparative study

dc.contributor.author	Munasinghe, SG
dc.contributor.author	Thayasivam, U
dc.contributor.editor	Gunawardena, S
dc.date.accessioned	2025-11-24T09:11:38Z
dc.date.issued	2025
dc.description.abstract	Predicting health is a field where machine learning (ML) plays a crucial role [1], offering data-driven insights for early detection and diagnosis. With the increasing availability of internet-based data, there is potential for ML models to leverage digital footprints—such as social media activity, online surveys, and behavioral patterns—to predict mental health conditions like depression. However, while these advancements create new opportunities for mental health monitoring, the effectiveness of ML-based prediction largely depends on the quality of the underlying training data. [2] Real-world datasets often contain noise, inconsistencies, and mislabeled instances, making data preprocessing a critical step in predictive modeling. Prior research highlights that data cleaning significantly influences model generalization [3], but an important question remains: How to clean data to improve model performance without removing informative data. Furthermore, the choice of machine learning model plays a key role in handling such variations in data quality. We evaluate multiple classification models—K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and XGBoost—under different data preprocessing conditions to assess how cleaning affects their performance. By analyzing these models, we aim to provide insights into the relationship between data preprocessing, model robustness, and classification accuracy in mental health prediction.The rest of this paper is structured as follows: Section II discusses related work on data preprocessing and model selection. Section III details our methodology, including data sources, preprocessing steps, and model training. Section IV presents experimental results and analysis. Section V concludes the study and outlines future research directions.
dc.identifier.conference	Applied Data Science & Artificial Intelligence (ADScAI) Symposium 2025
dc.identifier.department	Department of Computer Science & Engineering
dc.identifier.doi	https://doi.org/10.31705/ADScAI.2025.10
dc.identifier.email	geesan.22@cse.mrt.ac.lk
dc.identifier.email	rtuthaya@cse.mrt.ac.lk
dc.identifier.faculty	Engineering
dc.identifier.place	Moratuwa, Sri Lanka
dc.identifier.proceeding	Proceedings of Applied Data Science & Artificial Intelligence Symposium 2025
dc.identifier.uri	https://dl.lib.uom.lk/handle/123/24461
dc.language.iso	en
dc.publisher	Department of Computer Science and Engineering
dc.subject	Data Cleaning
dc.subject	Model Selection
dc.subject	Classification
dc.subject	Machine Learning
dc.title	The Impact of data cleaning and model selection for depression prediction: a comparative study
dc.type	Conference-Extended-Abstract

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Paper 10 - ADScAI 2025.pdf
Size:: 101.67 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

ADScAI - 2025