Topic identification from news blog in Spanish language
DOI:
https://doi.org/10.33936/isrtic.v6i1.4514Keywords:
LDA, Topic modeling, news, blogAbstract
Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content. LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.
Downloads
References
Blei, D.M., Ng, A.Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Buenano-Fernandez, D., Gonzalez, M., Gil, D., & Lujan-Mora, S. (2020). Text Mining of Open-Ended Questions in Self-Assessment of University Teachers: An LDA Topic Modeling Approach. IEEE Access, 8, 35318–35330. doi: 10.1109/ACCESS.2020.2974983.
Chipidza, W., Akbaripourdibazar, E., Gwanzura, T., & Gatto, N.M. (2021). A topic analysis of traditional and social media news coverage of the early COVID-19 pandemic and implications for public health communication. Disaster Medicine and Public Health Preparedness, 3, 1-8. doi:10.1017/dmp.2021.65
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189. doi:10.1017/pan.2017.44
Guangce, R., and Lei, X. (2021). Knowledge discovery of news text based on artificial intelligence. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(1), 2021, doi:10.1145/3418062
Kim, D., Seo, D., Cho, S., & Kang, P. (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. (Ny)., 477, 15–29. doi: 10.1016/j.ins.2018.10.006.
Larsen, V.H., & Thorsrud, L. A. (2019). The value of news for economic developments. Journal of Econometrics, 210(1), 203–218, doi: 10.1016/j.jeconom.2018.11.013
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP 2011 - Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., 2, 262–272.
Misztal-Radecka, J. (2018). Building semantic user profile for polish web news portal. Computer Science, 19(3), 307–332, 2018.
Oliveira Capela F. de, & Ramirez-Marquez, J. E. (2019). Detecting urban identity perception via newspaper topic modeling. Cities, 93, 72–83. doi: 10.1016/j.cities.2019.04.009.
Xu, G., Meng, Y., Chen, Z., Qiu, X., Wang, C., and Yao, H. (2019). Research on Topic Detection and Tracking for Online News Texts. IEEE Access, 7, 58407–58418, doi: 10.1109/ACCESS.2019.2914097.
Xu, L., Yeo, H., Hwang, H., & Kim, K.O. (2020). 5G service and discourses on hyper-connected society in south Korea: Text mining of online news. Advances in Intelligent Systems and Computing, 1120, 892-897, doi:10.1007/978-3-030-39442-4_68
Wang, W., Feng, Y., & Dai, W. (2018). Topic analysis of online reviews for two competitive products using latent Dirichlet allocation. Electron. Commer. Res. Appl., 29, 142–156. doi: 10.1016/j.elerap.2018.04.003.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Ruth María Reátegui Rojas, Priscila Marisela Valdiviezo Díaz, Lizbeth Carolina Pacheco Guevara

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Articles submitted to this journal for publication will be released for open access under a Creative Commons Attribution Non-Commercial No Derivative Works licence (http://creativecommons.org/licenses/by-nc-nd/4.0).
The authors retain copyright, and are therefore free to share, copy, distribute, perform and publicly communicate the work under the following conditions: Acknowledge credit for the work specified by the author and indicate if changes were made (you may do so in any reasonable way, but not in a way that suggests that the author endorses your use of his or her work. Do not use the work for commercial purposes. In case of remixing, transformation or development, the modified material may not be distributed.



