An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi

Shashi Shekhar; Dilip Kumar Sharma; M.M. Sufyan Beg

doi:10.13053/cys-24-4-3151

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi

Autores/as

Shashi Shekhar GLA University, Department of Computer Engineering and Applications
Dilip Kumar Sharma GLA University, Department of Computer Engineering and Applications
M.M. Sufyan Beg Aligarh Muslim University, Department of Computer Engineering

DOI:

https://doi.org/10.13053/cys-24-4-3151

Palabras clave:

Language identification, transliteration, character embedding, word embedding, NLP, machine learning

Resumen

The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.

Biografía del autor/a

Shashi Shekhar, GLA University, Department of Computer Engineering and Applications

Assistant ProfessorDepartment of Computer Engineering and ApplicationGLA University,MathuraIndia

Descargas

PDF (English)

Publicado

2020-12-02

Número

Vol. 24 Núm. 4 (2020): 24(4)2020

Sección

Artículos

Licencia

Transfiero exclusivamente a la revista “Computación y Sistemas”, editada por el Centro de Investigación en Computación (CIC), los Derechos de Autor del artículo antes mencionado, asimismo acepto que no serán transferidos a ninguna otra publicación, en cualquier formato, idioma, medio existente (incluyendo los electrónicos y multimedios) o por desarrollar.

Certifico que el artículo, no ha sido divulgado previamente o sometido simultáneamente a otra publicación y que no contiene materiales cuya publicación violaría los Derechos de Autor u otros derechos de propiedad de cualquier persona, empresa o institución. Certifico además que tengo autorización de la institución o empresa donde trabajo o estudio para publicar este Trabajo.

El autor, representante acepta la responsabilidad por la publicación del Trabajo en nombre de todos y cada uno de los autores.

Esta Transferencia está sujeta a las siguientes reservas:

Los autores conservan todos los derechos de propiedad (tales como derechos de patente) de este Trabajo, con excepción de los derechos de publicación transferidos al CIC, mediante este documento.
Los autores conservan el derecho de publicar el Trabajo total o parcialmente en cualquier libro del que ellos sean autores o editores y hacer uso personal de este trabajo en conferencias, cursos, páginas web personal, etc.

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi

Autores/as

DOI:

Palabras clave:

Resumen

Biografía del autor/a

Shashi Shekhar, GLA University, Department of Computer Engineering and Applications

Descargas

Publicado

Número

Sección

Licencia

Desarrollado por

Información

Idioma