Impact of data normalization on the accuracy of supervised learning models
Original Article
DOI:
https://doi.org/10.33936/riemat.v10i2.7853Keywords:
supervised learning, data normalization, cross-validation, binary classification, class imbalanceAbstract
Feature normalization is a key step in supervised classification, especially when data are presented on heterogeneous scales. This study aims to evaluate the impact of two normalization strategies (MinMax and Z-Score) on the performance of three models: Logistic Regression, SVC/SVM, and Decision Tree, applied to four datasets: Adult Income, Heart Disease, Student Performance Math, and Student Performance Portuguese, obtained from the Machine Learning Repository. As a methodology, the models were trained using stratified cross-validation (k=5) and compared in terms of accuracy, precision, recall, F1-score, and ROC-AUC. The results showed that normalization with Z-Score had a significant effect on the Adult Income dataset, improving the performance of Logistic Regression (F1-score: 0.426 to 0.666; ROC-AUC: 0.641 to 0.904). In contrast, the Heart Disease dataset performed well even without normalization, but SVC/SVM with Z-Score improved its metrics with normalization (F1-score: 0.741 to 0.881; ROC-AUC: 0.785 to 0.922). However, these differences did not reach statistical significance according to the Wilcoxon test (p≈0.0625), although they do constitute moderate evidence. In the Student Performance datasets, the effects of normalization were minimal and statistically insignificant, which can be explained by the fact that the variables were already on comparable scales. Finally, it is confirmed that normalization does not affect all algorithms equally: its impact is more evident in socioeconomic and clinical contexts, where variables tend to use very different scales. This evidence provides practical elements to guide data preprocessing in areas such as health, education, and industry.
Downloads
References
Adnan Aslam, M., Murtaza, F., Ehatisham Ul Haq, M., Yasin, A., & Ali, N. (2025). SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, Vol. 10, Page 27, 10(3), 27. https://doi.org/10.3390/DATA10030027
Ahsan, M. M., Mahmud, M. A. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies, 9(3). https://doi.org/10.3390/technologies9030052
AKSU, G., GÜZELLER, C. O., & ESER, M. T. (2019). The Effect of the Normalization Method Used in Different Sample Sizes on the Success of Artificial Neural Network Model. International Journal of Assessment Tools in Education, 6(2), 170–192. https://doi.org/10.21449/ijate.479404
Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P. (2022). Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.1016/J.CMPB.2021.106504
Brooks, C., Kovanović, V., & Nguyen, Q. (2023). Predictive modeling of student success. Handbook of Artificial Intelligence in Education, 350–369. https://doi.org/10.4337/9781800375413.00027
Bujang, S. D. A., Selamat, A., Ibrahim, R., Krejcar, O., Herrera-Viedma, E., Fujita, H., & Ghani, N. A. M. (2021). Multiclass Prediction Model for Student Grade Prediction Using Machine Learning. IEEE Access, 9, 95608–95621. https://doi.org/10.1109/ACCESS.2021.3093563
Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. https://archive.ics.uci.edu/ml/datasets/student+performance
de Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2023). The choice of scaling technique matters for classification performance. Applied Soft Computing, 133. https://doi.org/10.1016/j.asoc.2022.109924
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., & Bloedow, D. (1989). Heart Disease Dataset [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease
Dua, D. & G. C. (2019). Machine Learning Repository. UCI Machine Learning Repository. https://archive.ics.uci.edu/
Dudzik, W., Nalepa, J., & Kawulok, M. (2024). Ensembles of evolutionarily-constructed support vector machine cascades. Knowledge-Based Systems, 288. https://doi.org/10.1016/J.KNOSYS.2024.111490
Elik, A. C. ¸. (2024). Acadlore Transactions on AI and Machine Learning Evaluating the Impact of Data Normalization on Rice Classification Using Machine Learning Algorithms. Acadlore Trans. Mach. Learn, 3(3), 162–171. https://doi.org/10.56578/ataiml030
Goedhart, J. M., Klausch, T., Janssen, J., & van de Wiel, M. A. (2025). Adaptive Use of Co-Data Through Empirical Bayes for Bayesian Additive Regression Trees. Statistics in Medicine, 44(5), e70004. https://doi.org/10.1002/SIM.70004;PAGE:STRING:ARTICLE/CHAPTER
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., & Smith, N. J. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. https://doi.org/10.1109/MCSE.2007.55
Kohavi, R., & Becker, B. (1996). UCI Machine Learning Repository: Adult Data Set (Census Income). https://archive.ics.uci.edu/ml/datasets/adult
Mahmud Sujon, K., Binti Hassan, R., Tusnia Towshi, Z., Othman, M. A., Abdus Samad, M., & Choi, K. (2024). When to Use Standardization and Normalization: Empirical Evidence from Machine Learning Models and XAI. IEEE Access, 12, 135300–135314. https://doi.org/10.1109/ACCESS.2024.3462434
McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conf. https://doi.org/10.25080/Majora-92bf1922-00a
Modhugu, V. R., & Ponnusamy, S. (2024). Comparative Analysis of Machine Learning Algorithms for Liver Disease Prediction: SVM, Logistic Regression, and Decision Tree. Asian Journal of Research in Computer Science, 17(6), 188-201. https://doi.org/10.9734/ajrcos/2024/v17i6467
Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F., & Harmouch, H. (2022). The Effects of Data Quality on Machine Learning Performance. 1. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2012, 2825–2830. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
Rao, P., … A. R. I. S. and C. (ICISC, & 2024, undefined. (2024). Machine Learning Approaches for Diabetes Prediction: Comparative Analysis and Pre-processing Insights. Ieeexplore.Ieee.OrgPVK Rao, AS Rao2024 8th International Conference on Inventive Systems and Control, 2024•ieeexplore.Ieee.Org. https://ieeexplore.ieee.org/abstract/document/10677564/
Rataj, M., Zhang, X., Wang, J.-Q., Shantal, M., Othman, Z., Abu Bakar, A., & My, A. A. B. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry 2023, Vol. 15, Page 2185, 15(12), 2185. https://doi.org/10.3390/SYM15122185
Salian, S., Cherishma, S., & & Powar, O. S. (2024). Enhanced Brain Tumor Detection using Support Vector Classifier and Logistic Regression with Principal Component Analysis. In 2024 Control Instrumentation System Conference (CISCON) (pp. 1-5). IEEE. https://ieeexplore.ieee.org/document/10442059
Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min-Max Normalization. Symmetry, 15(12). https://doi.org/10.3390/SYM15122185
Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/J.ASOC.2019.105524
Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., & Müller, K. R. (2021). Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Machine Learning and Knowledge Extraction, 3(2), 392–413. https://doi.org/10.3390/make3020020
Uddin, S., & Lu, H. (2024). Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 14(1). https://doi.org/10.1038/S41598-024-51825-X
Waskom, M. L. (2011). seaborn: statistical data visualization. GitHub / Zenodo (Según Fuente Que Uses). https://doi.org/10.5281/zenodo.592845
Yan, Y. (2025). The optimization and impact of public sports service quality based on the supervised learning model and artificial intelligence. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-94613-x
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Mariuxi Guillen Intriago, Roberth Alcívar Cevallos

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.