Tokenizer Adapted for Nasa Yuwe Language
Abstract
In Colombia, ethnic and cultural diversity are conceived by the government to be a social right. They find expression, among other ways, in the large number of indigenous languages that have been kept alive for centuries. However, efforts towards the conservation and preservation of these languages have generally fallen short. This is the case with the Nasa Yuwe language, spoken by the Nasa, or Páez indigenous community, the status of which is endangered. Given such a predicament, the use of technology has been found to provide a strategic opportunity for the adaptation, ownership and development of Nasa Yuwe within the social and cultural environment of the Nasa people. The technology includes the use of computational techniques that allow the exchange of information by means of IR activities. These encourage different, new possibilities for Nasa people to be able to interact in Nasa Yuwe. It has therefore become necessary to adapt stages of the IR process for this language. The current paper specifically presents a process for adapting a tokenizer for texts written in Nasa Yuwe. It involves making use of the precision-recall curve as an evaluation and comparison measure. The results presented allow an appreciation of: all stages in the process of adapting the standard tokenizer to produce the Nasa version; the Nasa tokenizer and its results over texts written in Nasa Yuwe; and the analysis of the precision-recall curve baseline in contrast to that of the Nasa tokenizer.
Keywords
Nasa indigenous community, Nasa Yuwe language, tokenizer for Nasa Yuwe, information retrieval for texts written in Nasa Yuwe.