Yuhmu Database: A Corpus of Tonal Speech Lacking Conventional Writing

Eric Ramos-Aguilar, J. Arturo Olvera-López, Ivan Olmos-Pineda

Abstract


This paper presents the development and analysis of a digital audio database of words pronounced in Yuhmu, a tonal and endangered variant of the Otomi language spoken in Ixtenco, Tlaxcala, Mexico. The database is composed of over 8,000 word recordings, including both correct and incorrect pronunciations, which were evaluated by native speakers through perceptual judgments. Statistical analyses reveal linguistic diversity in the phonetic components. Additionally, three experimental methodologies were implemented to evaluate the database: automatic segmentation of Mel-scale spectrograms using cosine distance, pronunciation classification via a multilayer perceptron, and implicit segmentation based on cosine distance thresholds. The results demonstrate good accuracy and successful detection of phonetic boundaries, which is comparable to methods applied to languages with a strong digital presence. This database constitutes a fundamental resource for the analysis of under documented tonal indigenous languages, highlighting the importance of preserving linguistic diversity. The controlled acoustic conditions and phonetic variability present in the database provide a solid foundation for future interdisciplinary studies in computational linguistics, machine learning, and language preservation.

Keywords


Low resources languages, yuhmu mexican language, phonetic representation of speech, database

Full Text: PDF