Cyberbullying Detection in a Multi-classification Codemixed Dataset

Sahinur Rahman-Laskar, Gauri Gupta, Ritika Badhani, David Eduardo Pinto-Avendaño

Abstract


In an era characterized by digital communication and social media, the concept of cyberbullying has arisen as a social concern, impacting individuals of all ages. It refers to the act of using digital communication tools like, social media, and messaging apps, to harass intimidate or harm someone. Codemixed cyberbullying refers to the use of multiple languages or a mix of languages in online communications and the use of multiple languages or a mix of languages can sometimes make it challenging for content moderators or automated systems to detect and address cyberbullying effectively. The challenges include the availability of standard codemixed datasets, especially for Indian languages.This paper investigates cyberbullying detection in Hinglish, a code-mixed language of Hindi and English. We have created a novel multi-class Hinglish dataset, annotated across seven cyberbullying categories: age, gender, religion, mockery, abusive, offensive, and not cyberbullying, and explored different machine learning-based models. We have performed a comparative analysis based on the standard evaluation metrics and achieved a state-of-the-art result on a multi-class codemixed Hinglish dataset.

Keywords


Cyberbullying, codemixed, Hinglish, machine learning

Full Text: PDF