Generation of Bilingual Dictionaries using Structural Properties
Abstract
Building bilingual dictionaries from Wikipedia
has been extensively studied in the area of computation
linguistics. These dictionaries play a crucial role in
Natural Language Processing(NLP) applications like
Cross-Lingual Information Retrieval, Machine Translation
and Named Entity Recognition. To build these
dictionaries, most of the existing approaches use
information present in Wikipedia titles, info-boxes and
categories. Interestingly, not many use the structural
properties of a document like sections, subsections,
etc. In this work we exploit the structural properties of
documents to build a bilingual English-Hindi dictionary.
The main intuition behind this approach is that
documents in different languages discussing the same
topic are likely to have similar structural elements.
Though we present our experiments only for Hindi, our
approach is language independent and can be easily
extended to other languages. The major contribution of
our work is that the dictionary contains translation and
transliteration of words which include Named Entities to a
large extent. We evaluate our dictionary using manually
computed precision. We generated a massive list of 72k
tokens using our approach with 0.75 precision.
has been extensively studied in the area of computation
linguistics. These dictionaries play a crucial role in
Natural Language Processing(NLP) applications like
Cross-Lingual Information Retrieval, Machine Translation
and Named Entity Recognition. To build these
dictionaries, most of the existing approaches use
information present in Wikipedia titles, info-boxes and
categories. Interestingly, not many use the structural
properties of a document like sections, subsections,
etc. In this work we exploit the structural properties of
documents to build a bilingual English-Hindi dictionary.
The main intuition behind this approach is that
documents in different languages discussing the same
topic are likely to have similar structural elements.
Though we present our experiments only for Hindi, our
approach is language independent and can be easily
extended to other languages. The major contribution of
our work is that the dictionary contains translation and
transliteration of words which include Named Entities to a
large extent. We evaluate our dictionary using manually
computed precision. We generated a massive list of 72k
tokens using our approach with 0.75 precision.
Keywords
Bilingual dictionary, comparable corpora, structural elements.