POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages
DOI:
https://doi.org/10.13053/cys-20-4-2430Keywords:
POS tagging, alignment, parallel corpus, under-resourced languages.Abstract
Almost all languages lack sufficient resources and tools for developing Human Language Technologies (HLT). These latters are mostly concerned by languages for which large resources and tools are available. In this paper, we will prove that under-resourced languages can benefit from these available resources and tools to develop their own HLT by taking as an example the which of the POS tagging Task that is among the most primordial Natural Language Processing tasks. Since, it assigns word tag to highlight its syntactic features by considering the corresponding contexts. The solution that we propose, in this research work, is based on the use of aligned parallel corpus as a bridge between a rich-resourced language and an under resourced language. This kind of corpus is usually available. The rich language side of this corpus is first annotated. These POS-annotations were then exploited to predict the annotation of under-resourced language side by using alignment training. After this training step, we obtain a matching table between the two languages which will be exploited to annotate an input text. The experimentation of the proposed approach is performed on a couple of languages: English as a rich language and Arabic as an under resourced language. We used the IWSLT10 training corpus, and English Treetagger. The approach was evaluated on the test corpus extracted from the IWSLT08 obtain a F-score of 89% and can be extrapolated to the other NLP tasks.Downloads
Published
2016-12-18
Issue
Section
Articles
License
Hereby I transfer exclusively to the Journal "Computación y Sistemas", published by the Computing Research Center (CIC-IPN),the Copyright of the aforementioned paper. I also accept that these
rights will not be transferred to any other publication, in any other format, language or other existing means of developing.I certify that the paper has not been previously disclosed or simultaneously submitted to any other publication, and that it does not contain material whose publication would violate the Copyright or other proprietary rights of any person, company or institution. I certify that I have the permission from the institution or company where I work or study to publish this work.The representative author accepts the responsibility for the publicationof this paper on behalf of each and every one of the authors.
This transfer is subject to the following conditions:- The authors retain all ownership rights (such as patent rights) of this work, except for the publishing rights transferred to the CIC, through this document.
- Authors retain the right to publish the work in whole or in part in any book they are the authors or publishers. They can also make use of this work in conferences, courses, personal web pages, and so on.
- Authors may include working as part of his thesis, for non-profit distribution only.