Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features

Zahurul Islam, Alexander Mehler

Abstract


This paper presents a classifier of text readability based on information-theoretic features.

The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well as their linguistic counterparts even if we explore several linguistic levels at once.


Keywords


Text readability, Wikipedia, enthropy, information transmission, evaluation of features.

Full Text: PDF