Impacto de la normalización de datos en la precisión de modelos de aprendizaje supervisado.

Artículo Original

Autores/as

Mariuxi Guillen Intriago Facultad de Posgrado, Universidad Técnica de Manabí. Portoviejo, Ecuador. https://orcid.org/0009-0005-6923-0538
Roberth Alcívar Cevallos Facultad de Posgrado, Universidad Técnica de Manabí. Portoviejo, Ecuador. https://orcid.org/0000-0001-6282-8493

DOI:

https://doi.org/10.33936/riemat.v10i2.7853

Palabras clave:

aprendizaje supervisado, normalización de datos, validación cruzada, clasificación binaria, desequilibrio de clases

Resumen

La normalización de características, es un paso clave en clasificación supervisada, especialmente cuando los datos presentan escalas heterogéneas. Este estudio tiene objetivo evaluar el impacto de dos estrategias de normalización (MinMax y Z-Score) en el rendimiento de tres modelos: Regresión Logística, SVC/SVM y Árbol de Decisión, aplicados a cuatro datasets: Adult Income, Heart Disease, Student Performance Math y Student Performance Portuguese, obtenidos del repositorio Machine Learning Repository. Como metodología los modelos se entrenaron utilizando validación cruzada estratificada (k=5) y se compararon en términos de accuracy, precisión, recall, F1-score y ROC-AUC. Los resultados mostraron que la normalización con Z-Score tuvo un efecto significativo en el dataset de Adult Income, mejorando el rendimiento de la Regresión Logística (F1-score: 0.426 a 0.666; ROC-AUC: 0.641 a 0.904). En contraste, el dataset de Heart Disease mostró un buen rendimiento, incluso sin normalización, el SVC/SVM con Z-Score mejoró sus métricas con la normalización (F1-score: 0.741 a 0.881; ROC-AUC: 0.785 a 0.922). Sin embargo, estas diferencias no alcanzaron significancia estadística según el test de Wilcoxon (p≈0.0625), aunque si constituyen evidencia moderada. En los datasets de Student Performance los efectos de la normalización fueron mínimos y estadísticamente no significativos, lo cual puede explicarse porque las variables ya se encontraban en escalas comparables. Finalmente se confirman que la normalización no afecta por igual a todos los algoritmos: su impacto es más evidente en contextos socioeconómicos y clínicos, donde las variables suelen manejar escalas muy distintas. Esta evidencia aporta elementos prácticos para orientar el preprocesamiento de datos en áreas como salud, educación e industria.

Descargas

La descarga de datos todavía no está disponible.

Citas

Adnan Aslam, M., Murtaza, F., Ehatisham Ul Haq, M., Yasin, A., & Ali, N. (2025). SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data 2025, Vol. 10, Page 27, 10(3), 27. https://doi.org/10.3390/DATA10030027

Ahsan, M. M., Mahmud, M. A. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies, 9(3). https://doi.org/10.3390/technologies9030052

AKSU, G., GÜZELLER, C. O., & ESER, M. T. (2019). The Effect of the Normalization Method Used in Different Sample Sizes on the Success of Artificial Neural Network Model. International Journal of Assessment Tools in Education, 6(2), 170–192. https://doi.org/10.21449/ijate.479404

Bailly, A., Blanc, C., Francis, É., Guillotin, T., Jamal, F., Wakim, B., & Roy, P. (2022). Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine, 213, 106504. https://doi.org/10.1016/J.CMPB.2021.106504

Brooks, C., Kovanović, V., & Nguyen, Q. (2023). Predictive modeling of student success. Handbook of Artificial Intelligence in Education, 350–369. https://doi.org/10.4337/9781800375413.00027

Bujang, S. D. A., Selamat, A., Ibrahim, R., Krejcar, O., Herrera-Viedma, E., Fujita, H., & Ghani, N. A. M. (2021). Multiclass Prediction Model for Student Grade Prediction Using Machine Learning. IEEE Access, 9, 95608–95621. https://doi.org/10.1109/ACCESS.2021.3093563

Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. https://archive.ics.uci.edu/ml/datasets/student+performance

de Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2023). The choice of scaling technique matters for classification performance. Applied Soft Computing, 133. https://doi.org/10.1016/j.asoc.2022.109924

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., & Bloedow, D. (1989). Heart Disease Dataset [Dataset]. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/45/heart+disease

Dua, D. & G. C. (2019). Machine Learning Repository. UCI Machine Learning Repository. https://archive.ics.uci.edu/

Dudzik, W., Nalepa, J., & Kawulok, M. (2024). Ensembles of evolutionarily-constructed support vector machine cascades. Knowledge-Based Systems, 288. https://doi.org/10.1016/J.KNOSYS.2024.111490

Elik, A. C. ¸. (2024). Acadlore Transactions on AI and Machine Learning Evaluating the Impact of Data Normalization on Rice Classification Using Machine Learning Algorithms. Acadlore Trans. Mach. Learn, 3(3), 162–171. https://doi.org/10.56578/ataiml030

Goedhart, J. M., Klausch, T., Janssen, J., & van de Wiel, M. A. (2025). Adaptive Use of Co-Data Through Empirical Bayes for Bayesian Additive Regression Trees. Statistics in Medicine, 44(5), e70004. https://doi.org/10.1002/SIM.70004;PAGE:STRING:ARTICLE/CHAPTER

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., & Smith, N. J. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. https://doi.org/10.1109/MCSE.2007.55

Kohavi, R., & Becker, B. (1996). UCI Machine Learning Repository: Adult Data Set (Census Income). https://archive.ics.uci.edu/ml/datasets/adult

Mahmud Sujon, K., Binti Hassan, R., Tusnia Towshi, Z., Othman, M. A., Abdus Samad, M., & Choi, K. (2024). When to Use Standardization and Normalization: Empirical Evidence from Machine Learning Models and XAI. IEEE Access, 12, 135300–135314. https://doi.org/10.1109/ACCESS.2024.3462434

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conf. https://doi.org/10.25080/Majora-92bf1922-00a

Modhugu, V. R., & Ponnusamy, S. (2024). Comparative Analysis of Machine Learning Algorithms for Liver Disease Prediction: SVM, Logistic Regression, and Decision Tree. Asian Journal of Research in Computer Science, 17(6), 188-201. https://doi.org/10.9734/ajrcos/2024/v17i6467

Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F., & Harmouch, H. (2022). The Effects of Data Quality on Machine Learning Performance. 1. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2012, 2825–2830. http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html

Rao, P., … A. R. I. S. and C. (ICISC, & 2024, undefined. (2024). Machine Learning Approaches for Diabetes Prediction: Comparative Analysis and Pre-processing Insights. Ieeexplore.Ieee.OrgPVK Rao, AS Rao2024 8th International Conference on Inventive Systems and Control, 2024•ieeexplore.Ieee.Org. https://ieeexplore.ieee.org/abstract/document/10677564/

Rataj, M., Zhang, X., Wang, J.-Q., Shantal, M., Othman, Z., Abu Bakar, A., & My, A. A. B. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry 2023, Vol. 15, Page 2185, 15(12), 2185. https://doi.org/10.3390/SYM15122185

Salian, S., Cherishma, S., & & Powar, O. S. (2024). Enhanced Brain Tumor Detection using Support Vector Classifier and Logistic Regression with Principal Component Analysis. In 2024 Control Instrumentation System Conference (CISCON) (pp. 1-5). IEEE. https://ieeexplore.ieee.org/document/10442059

Shantal, M., Othman, Z., & Bakar, A. A. (2023). A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min-Max Normalization. Symmetry, 15(12). https://doi.org/10.3390/SYM15122185

Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/J.ASOC.2019.105524

Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., & Müller, K. R. (2021). Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Machine Learning and Knowledge Extraction, 3(2), 392–413. https://doi.org/10.3390/make3020020

Uddin, S., & Lu, H. (2024). Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 14(1). https://doi.org/10.1038/S41598-024-51825-X

Waskom, M. L. (2011). seaborn: statistical data visualization. GitHub / Zenodo (Según Fuente Que Uses). https://doi.org/10.5281/zenodo.592845

Yan, Y. (2025). The optimization and impact of public sports service quality based on the supervised learning model and artificial intelligence. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-94613-x

Descargas

Publicado

2025-10-12

Número

Vol. 10 Núm. 2 (2025): Julio - Diciembre

Sección

Artículos

Licencia

Derechos de autor 2025 Mariuxi Guillen Intriago, Roberth Alcívar Cevallos

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.

Impacto de la normalización de datos en la precisión de modelos de aprendizaje supervisado.

Artículo Original

Autores/as

DOI:

Palabras clave:

Resumen

Descargas

Citas

Descargas

Publicado

Número

Sección

Licencia

Idioma

Redes Sociales

Información

Revista Riemat

	2021	2022	2023	2024
Envíos Recibidos	13	19	14	17
Envíos Aceptados	11	14	10	12
Envíos Rechazados	2	5	4	5
Envíos Publicados	11	14	10	12
Días para Aceptar	58	56	45	34
Días para Rechazar	45	45	15	9
Tasa de Aceptación	85%	74%	71%	71%
Tasa de Rechazo	15%	26%	29%	29%