Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System

Autores/as

  • Ankush Khandelwal International Institute of Information Technology, Language Technologies Research Centre, Hyderabad
  • Sahil Swami International Institute of Information Technology, Language Technologies Research Centre, Hyderabad
  • Syed Sarfaraz Akhtar International Institute of Information Technology, Language Technologies Research Centre, Hyderabad
  • Manish Shrivastava International Institute of Information Technology, Language Technologies Research Centre, Hyderabad

DOI:

https://doi.org/10.13053/cys-22-4-3061

Palabras clave:

Author profiling, code-mixing, language detection, linguistics, SVM, random forest

Resumen

The rapid expansion in the usage ofs ocial media networking sites leads to a huge amount of unprocessed user generated data which can beused for text mining. Author profiling is the problem of automatically determining profiling aspects like theauthor’s gender and age group through a text is gaining much popularity in computational linguistics. Most of the past research in author profiling is concentrated on English texts [1, 2]. However many users often change the language while posting on social media which is called code-mixing, and it develops some challenges in the field of text classification and author profiling like variations in spelling, non-grammatical structure and transliteration [3]. There are very few English-Hindicode-mixed annotated datasets of social media content present online [4]. In this paper, we analyze the taskof author’s gender prediction in code-mixed content and present a corpus of English-Hindi texts collected from Twitter which is annotated with author’s gender. We also explore language identification of every word in this corpus. We present a supervised classification baseline system which uses various machine learning algorithms to identify the gender of an author using a text, based on character and word level features.

Descargas

Publicado

2018-12-30