Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification

Sanja Štajner, Biljana Drndarevic, Horacio Saggion

Abstract


This study addresses the automatic
simplification of texts in Spanish in order to make them
more accessible to people with cognitive disabilities.
A corpus analysis of original and manually simplified
news articles was undertaken in order to identify
and quantify relevant operations to be implemented
in a text simplification system. The articles were
further compared at sentence and text level by
means of automatic feature extraction and various
machine learning classification algorithms, using three
different groups of features (POS frequencies, syntactic
information, and text complexity measures) with the
aim of identifying features that help separate original
documents from their simple equivalents. Finally, it
was investigated whether these features can be used
to decide upon simplification operations to be carried
out at the sentence level (split, delete, and reduce).
Automatic classification of original sentences into those
to be kept and those to be eliminated outperformed the
classification that was previously conducted on the same
corpus. Kept sentences were further classified into those
to be split or significantly reduced in length and those
to be left largely unchanged, with the overall F-measure
up to 0.92. Both experiments were conducted and
compared on two different sets of features: all features
and the best subset returned by an attribute selection
algorithm.

Keywords


Spanish text simplification, supervised learning, sentence classification.

Full Text: PDF