A Domain Specific Parallel Corpus and Enhanced English-Assamese Neural Machine Translation

Authors

  • Sahinur Rahman-Laskar National Institute of Technology Silchar
  • Riyanka Manna Adamas University
  • Partha Pakray National Institute of Technology Silchar
  • Sivaji Bandyopadhyay National Institute of Technology Silchar

DOI:

https://doi.org/10.13053/cys-26-4-4423

Keywords:

English-assamese, low-resource, neural machine translation, parallel corpus, data augmentation, prior alignment, language model

Abstract

Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pre-trained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.

Author Biographies

Sahinur Rahman-Laskar, National Institute of Technology Silchar

Department of Computer Science and Engineering

Riyanka Manna, Adamas University

Department of Computer Science and Engineering

Partha Pakray, National Institute of Technology Silchar

Department of Computer Science and Engineering

Sivaji Bandyopadhyay, National Institute of Technology Silchar

Department of Computer Science and Engineering

Downloads

Published

2022-12-25

Issue

Section

Articles