Clasificación temática automática exhaustiva del corpus Reuters 21578 con aprendizaje automático supervisado

Authors

  • Juan Manuel Arengas Acosta Universidad de Guanajuato
  • Rafael Guzman Cabrera Universidad de Guanajuato
  • Misael López Ramírez Universidad de Guanajuato
  • Anderson Smith Florez Fuentes Universidad de Guanajuato

DOI:

https://doi.org/10.13053/cys-29-1-4391

Keywords:

Document Classification, Machine Learning, Learning Models.

Abstract

Automatic text classification has established itself as a research discipline that merges advanced natural language processing (NLP) techniques with machine learning algorithms, allowing to efficiently categorize large volumes of textual documents. An innovative approach is proposed that integrates current preprocessing techniques with classical supervised learning algorithms to improve the classification accuracy of the Reuters-21578 corpus. A literature review, the implementation of preprocessing techniques (tokenization, lemmatization, stopword elimination, lowercase conversion and special character elimination), as well as the exploration of supervised learning algorithms (Logistic Regression, Support Vector Machines, Naïve Bayes, Random Forest and k-nearest neighbors) are proposed. Experiments were conducted with various configurations, combining preprocessing techniques, feature selection methods such as TF-IDF, and the aforementioned algorithms. Thus, the findings in the experimented scenarios reveal that the integration of these techniques and algorithms significantly improves the accuracy of text classification, resulting in a configuration suitable for the Reuters-21578 corpus that presents an accuracy of up to 98.6%. A rigorous and efficient empirical methodology is proposed, which can be applicable to various document corpora in text format.

Author Biography

Rafael Guzman Cabrera, Universidad de Guanajuato

Profesor Titular BDepartamento de Ingenieria Electrica

Published

2025-03-27

Issue

Section

Articles