Predictive Text for Agglutinative and Polysynthetic Languages
Abstract
This paper is devoted to the usage of morphological segmentation for language modelling and predictive text. Having interest in providing effective and ergonomic text prediction for low-resource languages,
we examine the task of predictive text entry for five under-resourced and Indigenous languages: Bashkir, Chuvach, K’iche’, Mari and Chukchi. In segmentation section, we overview used segmentation methods,
both statistical and morphological. They are used to create datasets from unannotated corpus for language modelling. We train models and measure normalized word and character level perplexities. In order to
evaluate the models, we use predictive text entry task and measure keystroke savings rate. We provide two variations of evaluation algorithm which differ in how they process the latest user input: one using it as a start of a new prediction, the other using only the unsegmented
part of it as a start of a new prediction. The best score is achieved with BPE for K’iche’ (16.08) with morphological segmentation being the second best (15.30). We find that neither of the segmentation models performs the best in both language modelling and predictive text tasks. In order to define the best performing one and to test our hypothesis about predictive text ergonomics, we plan to do end-user predictive text evaluation.
we examine the task of predictive text entry for five under-resourced and Indigenous languages: Bashkir, Chuvach, K’iche’, Mari and Chukchi. In segmentation section, we overview used segmentation methods,
both statistical and morphological. They are used to create datasets from unannotated corpus for language modelling. We train models and measure normalized word and character level perplexities. In order to
evaluate the models, we use predictive text entry task and measure keystroke savings rate. We provide two variations of evaluation algorithm which differ in how they process the latest user input: one using it as a start of a new prediction, the other using only the unsegmented
part of it as a start of a new prediction. The best score is achieved with BPE for K’iche’ (16.08) with morphological segmentation being the second best (15.30). We find that neither of the segmentation models performs the best in both language modelling and predictive text tasks. In order to define the best performing one and to test our hypothesis about predictive text ergonomics, we plan to do end-user predictive text evaluation.
Keywords
Morphological segmentation, agglutinative languages, polysynthetic languages, evaluation, language modelling, predictive text