A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits

Autores/as

  • Laurent Jakubina Université de Montréal
  • Philippe Langlais Université de Montréal

DOI:

https://doi.org/10.13053/cys-20-3-2465

Palabras clave:

Comparable Corpora, Bilingual Lexicon Induction, Distributional Approaches, Rare Word Translation

Resumen

Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several applications including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of monolingual word embeddings, as well as a method dedicated to identify translations of rare words. We carefully explore the hyper-parameters of each method and measure their impact on the task of identifying the translation of English words in Wikipedia into French. Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find translations, therefore pushing each method to its limit. We show that all the approaches we tested have a clear bias toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10% of rare words.

Biografía del autor/a

Laurent Jakubina, Université de Montréal

Is a PhD student at the Department of Computer Science and Operations Research (DIRO) of the Universite de Montréal. Under the supervision of Professor Philippe Langlais , he studies word alignment methods with a view to improving the identification of translations in big data.

Philippe Langlais, Université de Montréal

Is professor at the Computer Science and Operational Research department (DIRO) of University of Montreal. As a member of the RALI laboratory since 1998, he has been actively involved in the development of bilingual applications, including Machine Translation.

Descargas

Publicado

2016-09-30