Impacto de la normalización de datos en la precisión de modelos de aprendizaje supervisado.
Artículo Original
DOI:
https://doi.org/10.33936/riemat.v10i2.7853Palabras clave:
aprendizaje supervisado, normalización de datos, validación cruzada, clasificación binaria, desequilibrio de clasesResumen
La normalización de características, es un paso clave en clasificación supervisada, especialmente cuando los datos presentan escalas heterogéneas. Este estudio tiene objetivo evaluar el impacto de dos estrategias de normalización (MinMax y Z-Score) en el rendimiento de tres modelos: Regresión Logística, SVC/SVM y Árbol de Decisión, aplicados a cuatro datasets: Adult Income, Heart Disease, Student Performance Math y Student Performance Portuguese, obtenidos del repositorio Machine Learning Repository. Como metodología los modelos se entrenaron utilizando validación cruzada estratificada (k=5) y se compararon en términos de accuracy, precisión, recall, F1-score y ROC-AUC. Los resultados mostraron que la normalización con Z-Score tuvo un efecto significativo en el dataset de Adult Income, mejorando el rendimiento de la Regresión Logística (F1-score: 0.426 a 0.666; ROC-AUC: 0.641 a 0.904). En contraste, el dataset de Heart Disease mostró un buen rendimiento, incluso sin normalización, el SVC/SVM con Z-Score mejoró sus métricas con la normalización (F1-score: 0.741 a 0.881; ROC-AUC: 0.785 a 0.922). Sin embargo, estas diferencias no alcanzaron significancia estadística según el test de Wilcoxon (p≈0.0625), aunque si constituyen evidencia moderada. En los datasets de Student Performance los efectos de la normalización fueron mínimos y estadísticamente no significativos, lo cual puede explicarse porque las variables ya se encontraban en escalas comparables. Finalmente se confirman que la normalización no afecta por igual a todos los algoritmos: su impacto es más evidente en contextos socioeconómicos y clínicos, donde las variables suelen manejar escalas muy distintas. Esta evidencia aporta elementos prácticos para orientar el preprocesamiento de datos en áreas como salud, educación e industria.
Descargas
Citas
Adnan Aslam, M., Murtaza, F., Ehatisham Ul Haq, M., Yasin, A., & Ali, N. (2025). SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, Vol. 10, Page 27, 10(3), 27. https://doi.org/10.3390/DATA10030027
Ahsan, M. M., Mahmud, M. A. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies, 9(3). https://doi.org/10.3390/technologies9030052
AKSU, G., GÜZELLER, C. O., & ESER, M. T. (2019). The Effect of the Normalization Method Used in Different Sample Sizes on the Success of Artificial Neural Network Model. International Journal of Assessment Tools in Education, 6(2), 170–192. https://doi.org/10.21449/ijate.479404
Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P. (2022). Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.1016/J.CMPB.2021.106504
Brooks, C., Kovanović, V., & Nguyen, Q. (2023). Predictive modeling of student success. Handbook of Artificial Intelligence in Education, 350–369. https://doi.org/10.4337/9781800375413.00027
Bujang, S. D. A., Selamat, A., Ibrahim, R., Krejcar, O., Herrera-Viedma, E., Fujita, H., & Ghani, N. A. M. (2021). Multiclass Prediction Model for Student Grade Prediction Using Machine Learning. IEEE Access, 9, 95608–95621. https://doi.org/10.1109/ACCESS.2021.3093563
Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. https://archive.ics.uci.edu/ml/datasets/student+performance
de Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2023). The choice of scaling technique matters for classification performance. Applied Soft Computing, 133. https://doi.org/10.1016/j.asoc.2022.109924
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., & Bloedow, D. (1989). Heart Disease Dataset [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease
Dua, D. & G. C. (2019). Machine Learning Repository. UCI Machine Learning Repository. https://archive.ics.uci.edu/
Dudzik, W., Nalepa, J., & Kawulok, M. (2024). Ensembles of evolutionarily-constructed support vector machine cascades. Knowledge-Based Systems, 288. https://doi.org/10.1016/J.KNOSYS.2024.111490
Elik, A. C. ¸. (2024). Acadlore Transactions on AI and Machine Learning Evaluating the Impact of Data Normalization on Rice Classification Using Machine Learning Algorithms. Acadlore Trans. Mach. Learn, 3(3), 162–171. https://doi.org/10.56578/ataiml030
Goedhart, J. M., Klausch, T., Janssen, J., & van de Wiel, M. A. (2025). Adaptive Use of Co-Data Through Empirical Bayes for Bayesian Additive Regression Trees. Statistics in Medicine, 44(5), e70004. https://doi.org/10.1002/SIM.70004;PAGE:STRING:ARTICLE/CHAPTER
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., & Smith, N. J. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. https://doi.org/10.1109/MCSE.2007.55
Kohavi, R., & Becker, B. (1996). UCI Machine Learning Repository: Adult Data Set (Census Income). https://archive.ics.uci.edu/ml/datasets/adult
Mahmud Sujon, K., Binti Hassan, R., Tusnia Towshi, Z., Othman, M. A., Abdus Samad, M., & Choi, K. (2024). When to Use Standardization and Normalization: Empirical Evidence from Machine Learning Models and XAI. IEEE Access, 12, 135300–135314. https://doi.org/10.1109/ACCESS.2024.3462434
McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conf. https://doi.org/10.25080/Majora-92bf1922-00a
Modhugu, V. R., & Ponnusamy, S. (2024). Comparative Analysis of Machine Learning Algorithms for Liver Disease Prediction: SVM, Logistic Regression, and Decision Tree. Asian Journal of Research in Computer Science, 17(6), 188-201. https://doi.org/10.9734/ajrcos/2024/v17i6467
Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F., & Harmouch, H. (2022). The Effects of Data Quality on Machine Learning Performance. 1. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2012, 2825–2830. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
Rao, P., … A. R. I. S. and C. (ICISC, & 2024, undefined. (2024). Machine Learning Approaches for Diabetes Prediction: Comparative Analysis and Pre-processing Insights. Ieeexplore.Ieee.OrgPVK Rao, AS Rao2024 8th International Conference on Inventive Systems and Control, 2024•ieeexplore.Ieee.Org. https://ieeexplore.ieee.org/abstract/document/10677564/
Rataj, M., Zhang, X., Wang, J.-Q., Shantal, M., Othman, Z., Abu Bakar, A., & My, A. A. B. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry 2023, Vol. 15, Page 2185, 15(12), 2185. https://doi.org/10.3390/SYM15122185
Salian, S., Cherishma, S., & & Powar, O. S. (2024). Enhanced Brain Tumor Detection using Support Vector Classifier and Logistic Regression with Principal Component Analysis. In 2024 Control Instrumentation System Conference (CISCON) (pp. 1-5). IEEE. https://ieeexplore.ieee.org/document/10442059
Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min-Max Normalization. Symmetry, 15(12). https://doi.org/10.3390/SYM15122185
Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/J.ASOC.2019.105524
Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., & Müller, K. R. (2021). Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Machine Learning and Knowledge Extraction, 3(2), 392–413. https://doi.org/10.3390/make3020020
Uddin, S., & Lu, H. (2024). Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 14(1). https://doi.org/10.1038/S41598-024-51825-X
Waskom, M. L. (2011). seaborn: statistical data visualization. GitHub / Zenodo (Según Fuente Que Uses). https://doi.org/10.5281/zenodo.592845
Yan, Y. (2025). The optimization and impact of public sports service quality based on the supervised learning model and artificial intelligence. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-94613-x
Publicado
Número
Sección
Licencia
Derechos de autor 2025 Mariuxi Guillen Intriago, Roberth Alcívar Cevallos

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.