The Impact of data cleaning and model selection for depression prediction: a comparative study

dc.contributor.authorMunasinghe, SG
dc.contributor.authorThayasivam, U
dc.contributor.editorGunawardena, S
dc.date.accessioned2025-11-24T09:11:38Z
dc.date.issued2025
dc.description.abstractPredicting health is a field where machine learning (ML) plays a crucial role [1], offering data-driven insights for early detection and diagnosis. With the increasing availability of internet-based data, there is potential for ML models to leverage digital footprints—such as social media activity, online surveys, and behavioral patterns—to predict mental health conditions like depression. However, while these advancements create new opportunities for mental health monitoring, the effectiveness of ML-based prediction largely depends on the quality of the underlying training data. [2] Real-world datasets often contain noise, inconsistencies, and mislabeled instances, making data preprocessing a critical step in predictive modeling. Prior research highlights that data cleaning significantly influences model generalization [3], but an important question remains: How to clean data to improve model performance without removing informative data. Furthermore, the choice of machine learning model plays a key role in handling such variations in data quality. We evaluate multiple classification models—K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and XGBoost—under different data preprocessing conditions to assess how cleaning affects their performance. By analyzing these models, we aim to provide insights into the relationship between data preprocessing, model robustness, and classification accuracy in mental health prediction.The rest of this paper is structured as follows: Section II discusses related work on data preprocessing and model selection. Section III details our methodology, including data sources, preprocessing steps, and model training. Section IV presents experimental results and analysis. Section V concludes the study and outlines future research directions.
dc.identifier.conferenceApplied Data Science & Artificial Intelligence (ADScAI) Symposium 2025
dc.identifier.departmentDepartment of Computer Science & Engineering
dc.identifier.doihttps://doi.org/10.31705/ADScAI.2025.10
dc.identifier.emailgeesan.22@cse.mrt.ac.lk
dc.identifier.emailrtuthaya@cse.mrt.ac.lk
dc.identifier.facultyEngineering
dc.identifier.placeMoratuwa, Sri Lanka
dc.identifier.proceedingProceedings of Applied Data Science & Artificial Intelligence Symposium 2025
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/24461
dc.language.isoen
dc.publisherDepartment of Computer Science and Engineering
dc.subjectData Cleaning
dc.subjectModel Selection
dc.subjectClassification
dc.subjectMachine Learning
dc.titleThe Impact of data cleaning and model selection for depression prediction: a comparative study
dc.typeConference-Extended-Abstract

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Paper 10 - ADScAI 2025.pdf
Size:
101.67 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections