Topic identification from news blog in Spanish language

Authors

DOI:

https://doi.org/10.33936/isrtic.v6i1.4514

Keywords:

LDA, Topic modeling, news, blog

Abstract

Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content.  LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.

Downloads

Download data is not yet available.

References

Blei, D.M., Ng, A.Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Buenano-Fernandez, D., Gonzalez, M., Gil, D., & Lujan-Mora, S. (2020). Text Mining of Open-Ended Questions in Self-Assessment of University Teachers: An LDA Topic Modeling Approach. IEEE Access, 8, 35318–35330. doi: 10.1109/ACCESS.2020.2974983.

Chipidza, W., Akbaripourdibazar, E., Gwanzura, T., & Gatto, N.M. (2021). A topic analysis of traditional and social media news coverage of the early COVID-19 pandemic and implications for public health communication. Disaster Medicine and Public Health Preparedness, 3, 1-8. doi:10.1017/dmp.2021.65

Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189. doi:10.1017/pan.2017.44

Guangce, R., and Lei, X. (2021). Knowledge discovery of news text based on artificial intelligence. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(1), 2021, doi:10.1145/3418062

Kim, D., Seo, D., Cho, S., & Kang, P. (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. (Ny)., 477, 15–29. doi: 10.1016/j.ins.2018.10.006.

Larsen, V.H., & Thorsrud, L. A. (2019). The value of news for economic developments. Journal of Econometrics, 210(1), 203–218, doi: 10.1016/j.jeconom.2018.11.013

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP 2011 - Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., 2, 262–272.

Misztal-Radecka, J. (2018). Building semantic user profile for polish web news portal. Computer Science, 19(3), 307–332, 2018.

Oliveira Capela F. de, & Ramirez-Marquez, J. E. (2019). Detecting urban identity perception via newspaper topic modeling. Cities, 93, 72–83. doi: 10.1016/j.cities.2019.04.009.

Xu, G., Meng, Y., Chen, Z., Qiu, X., Wang, C., and Yao, H. (2019). Research on Topic Detection and Tracking for Online News Texts. IEEE Access, 7, 58407–58418, doi: 10.1109/ACCESS.2019.2914097.

Xu, L., Yeo, H., Hwang, H., & Kim, K.O. (2020). 5G service and discourses on hyper-connected society in south Korea: Text mining of online news. Advances in Intelligent Systems and Computing, 1120, 892-897, doi:10.1007/978-3-030-39442-4_68

Wang, W., Feng, Y., & Dai, W. (2018). Topic analysis of online reviews for two competitive products using latent Dirichlet allocation. Electron. Commer. Res. Appl., 29, 142–156. doi: 10.1016/j.elerap.2018.04.003.

Published

2022-05-27

How to Cite

[1]
Pacheco-Guevara, L., Reátegui, R. and Valdiviezo-Díaz, P. 2022. Topic identification from news blog in Spanish language. Informática y Sistemas. 6, 1 (May 2022), 31–37. DOI:https://doi.org/10.33936/isrtic.v6i1.4514.

Issue

Section

Regular Papers