Pre-processing English-Hindi Corpus for Statistical Machine Translation

Karunesh Kumar Arora; Shyam S. Agrawal

doi:10.13053/cys-21-4-2697

Pre-processing English-Hindi Corpus for Statistical Machine Translation

Authors

Karunesh Kumar Arora Centre for Development of Advanced Computing
Shyam S. Agrawal KIIT Group of Institutions, Sohna Road, Bhondsi, Gurugram

DOI:

https://doi.org/10.13053/cys-21-4-2697

Keywords:

Statistical Machine Translation, Preprocessing, Normalization, Named Entity handling

Abstract

Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, thus making it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair.

Author Biography

Karunesh Kumar Arora, Centre for Development of Advanced Computing

Speech and Natural Language Processing,Joint Director and Group Co-ordinator

Downloads

Published

2017-12-23

Issue

Vol. 21 No. 4 (2017): Advances in Human Language Technologies (Guest Editor: A. Gelbukh)

Section

Articles of the Thematic Issue

License

Hereby I transfer exclusively to the Journal "Computación y

Sistemas", published by the Computing Research Center (CIC-IPN),

the Copyright of the aforementioned paper. I also accept that these

rights will not be transferred to any other publication, in any other

format, language or other existing means of developing.

I certify that the paper has not been previously disclosed or simultaneo

usly submitted to any other publication, and that it does not contain

material whose publication would violate the Copyright or other

proprietary rights of any person, company or institution. I certify that

I have the permission from the institution or company where I work or

study to publish this work.

The representative author accepts the responsibility for the publication

of this paper on behalf of each and every one of the authors.

This transfer is subject to the following conditions:

The authors retain all ownership rights (such as patent rights) of this work, except for the publishing rights transferred to the CIC, through this document.
Authors retain the right to publish the work in whole or in part in any book they are the authors or publishers. They can also make use of this work in conferences, courses, personal web pages, and so on.
Authors may include working as part of his thesis, for non-profit distribution only.

Pre-processing English-Hindi Corpus for Statistical Machine Translation

Authors

DOI:

Keywords:

Abstract

Author Biography

Karunesh Kumar Arora, Centre for Development of Advanced Computing

Downloads

Published

Issue

Section

License

Developed By

Information

Language