Sentiment Analysis of Public Opinion in Spanish Speaking Countries Using BERT
Abstract
This study focuses on Sentiment Analysis (SA) specifically applied to the Spanish-speaking variant, using the pre-trained linguistic architecture Bidirectional Encoder Representations from Transformers (BERT). The effectiveness of the BERT architecture for detecting sentiment polarity in Spanish was explored. An experimental study was conducted using the TASS 2019 corpus, which included tweets from Spanish-speaking countries (Mexico, Costa Rica, Spain, Peru, and Uruguay). After cleaning the texts, a BERT model was refined to classify three sentiment polarities (positive, negative, and neutral). Traditional oversampling and synthetic data generation using ChatGPT were applied to correct imbalances in the classes analyzed. The model achieved 87% accuracy in the Mexican sample with balancing using synthetic data. However, the most innovative finding was that balancing with oversampling allowed 97% accuracy with balanced metrics, surpassing the generalizability of the previous methods. Oversampling balancing is the most robust strategy for understanding digital opinions. This approach allows machines to capture regional linguistic richness, facilitating informed strategic decision-making.
Keywords
Sentiment analysis, BERT, oversampling