Impact of data normalization on the accuracy of supervised learning models

Original Article

Authors

DOI:

https://doi.org/10.33936/riemat.v10i2.7853

Keywords:

supervised learning, data normalization, cross-validation, binary classification, class imbalance

Abstract

Feature normalization is a key step in supervised classification, especially when data are presented on heterogeneous scales. This study aims to evaluate the impact of two normalization strategies (MinMax and Z-Score) on the performance of three models: Logistic Regression, SVC/SVM, and Decision Tree, applied to four datasets: Adult Income, Heart Disease, Student Performance Math, and Student Performance Portuguese, obtained from the Machine Learning Repository. As a methodology, the models were trained using stratified cross-validation (k=5) and compared in terms of accuracy, precision, recall, F1-score, and ROC-AUC. The results showed that normalization with Z-Score had a significant effect on the Adult Income dataset, improving the performance of Logistic Regression (F1-score: 0.426 to 0.666; ROC-AUC: 0.641 to 0.904). In contrast, the Heart Disease dataset performed well even without normalization, but SVC/SVM with Z-Score improved its metrics with normalization (F1-score: 0.741 to 0.881; ROC-AUC: 0.785 to 0.922). However, these differences did not reach statistical significance according to the Wilcoxon test (p≈0.0625), although they do constitute moderate evidence. In the Student Performance datasets, the effects of normalization were minimal and statistically insignificant, which can be explained by the fact that the variables were already on comparable scales. Finally, it is confirmed that normalization does not affect all algorithms equally: its impact is more evident in socioeconomic and clinical contexts, where variables tend to use very different scales. This evidence provides practical elements to guide data preprocessing in areas such as health, education, and industry.

Downloads

Download data is not yet available.

References

Adnan Aslam, M., Murtaza, F., Ehatisham Ul Haq, M., Yasin, A., & Ali, N. (2025). SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, Vol. 10, Page 27, 10(3), 27. https://doi.org/10.3390/DATA10030027

Ahsan, M. M., Mahmud, M. A. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies, 9(3). https://doi.org/10.3390/technologies9030052

AKSU, G., GÜZELLER, C. O., & ESER, M. T. (2019). The Effect of the Normalization Method Used in Different Sample Sizes on the Success of Artificial Neural Network Model. International Journal of Assessment Tools in Education, 6(2), 170–192. https://doi.org/10.21449/ijate.479404

Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P. (2022). Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.1016/J.CMPB.2021.106504

Brooks, C., Kovanović, V., & Nguyen, Q. (2023). Predictive modeling of student success. Handbook of Artificial Intelligence in Education, 350–369. https://doi.org/10.4337/9781800375413.00027

Bujang, S. D. A., Selamat, A., Ibrahim, R., Krejcar, O., Herrera-Viedma, E., Fujita, H., & Ghani, N. A. M. (2021). Multiclass Prediction Model for Student Grade Prediction Using Machine Learning. IEEE Access, 9, 95608–95621. https://doi.org/10.1109/ACCESS.2021.3093563

Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. https://archive.ics.uci.edu/ml/datasets/student+performance

de Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2023). The choice of scaling technique matters for classification performance. Applied Soft Computing, 133. https://doi.org/10.1016/j.asoc.2022.109924

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., & Bloedow, D. (1989). Heart Disease Dataset [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease

Dua, D. & G. C. (2019). Machine Learning Repository. UCI Machine Learning Repository. https://archive.ics.uci.edu/

Dudzik, W., Nalepa, J., & Kawulok, M. (2024). Ensembles of evolutionarily-constructed support vector machine cascades. Knowledge-Based Systems, 288. https://doi.org/10.1016/J.KNOSYS.2024.111490

Elik, A. C. ¸. (2024). Acadlore Transactions on AI and Machine Learning Evaluating the Impact of Data Normalization on Rice Classification Using Machine Learning Algorithms. Acadlore Trans. Mach. Learn, 3(3), 162–171. https://doi.org/10.56578/ataiml030

Goedhart, J. M., Klausch, T., Janssen, J., & van de Wiel, M. A. (2025). Adaptive Use of Co-Data Through Empirical Bayes for Bayesian Additive Regression Trees. Statistics in Medicine, 44(5), e70004. https://doi.org/10.1002/SIM.70004;PAGE:STRING:ARTICLE/CHAPTER

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., & Smith, N. J. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. https://doi.org/10.1109/MCSE.2007.55

Kohavi, R., & Becker, B. (1996). UCI Machine Learning Repository: Adult Data Set (Census Income). https://archive.ics.uci.edu/ml/datasets/adult

Mahmud Sujon, K., Binti Hassan, R., Tusnia Towshi, Z., Othman, M. A., Abdus Samad, M., & Choi, K. (2024). When to Use Standardization and Normalization: Empirical Evidence from Machine Learning Models and XAI. IEEE Access, 12, 135300–135314. https://doi.org/10.1109/ACCESS.2024.3462434

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conf. https://doi.org/10.25080/Majora-92bf1922-00a

Modhugu, V. R., & Ponnusamy, S. (2024). Comparative Analysis of Machine Learning Algorithms for Liver Disease Prediction: SVM, Logistic Regression, and Decision Tree. Asian Journal of Research in Computer Science, 17(6), 188-201. https://doi.org/10.9734/ajrcos/2024/v17i6467

Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F., & Harmouch, H. (2022). The Effects of Data Quality on Machine Learning Performance. 1. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2012, 2825–2830. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html

Rao, P., … A. R. I. S. and C. (ICISC, & 2024, undefined. (2024). Machine Learning Approaches for Diabetes Prediction: Comparative Analysis and Pre-processing Insights. Ieeexplore.Ieee.OrgPVK Rao, AS Rao2024 8th International Conference on Inventive Systems and Control, 2024•ieeexplore.Ieee.Org. https://ieeexplore.ieee.org/abstract/document/10677564/

Rataj, M., Zhang, X., Wang, J.-Q., Shantal, M., Othman, Z., Abu Bakar, A., & My, A. A. B. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry 2023, Vol. 15, Page 2185, 15(12), 2185. https://doi.org/10.3390/SYM15122185

Salian, S., Cherishma, S., & & Powar, O. S. (2024). Enhanced Brain Tumor Detection using Support Vector Classifier and Logistic Regression with Principal Component Analysis. In 2024 Control Instrumentation System Conference (CISCON) (pp. 1-5). IEEE. https://ieeexplore.ieee.org/document/10442059

Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min-Max Normalization. Symmetry, 15(12). https://doi.org/10.3390/SYM15122185

Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/J.ASOC.2019.105524

Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., & Müller, K. R. (2021). Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Machine Learning and Knowledge Extraction, 3(2), 392–413. https://doi.org/10.3390/make3020020

Uddin, S., & Lu, H. (2024). Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 14(1). https://doi.org/10.1038/S41598-024-51825-X

Waskom, M. L. (2011). seaborn: statistical data visualization. GitHub / Zenodo (Según Fuente Que Uses). https://doi.org/10.5281/zenodo.592845

Yan, Y. (2025). The optimization and impact of public sports service quality based on the supervised learning model and artificial intelligence. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-94613-x

Published

2025-10-12