Ground Truth Spanish Automatic Extractive Text Summarization Bounds

Griselda Areli Matias Mendoza; Yulia Ledeneva; René Arnulfo García Hernández; Mikhail Alexandrov; Ángel Hernández Castañeda

doi:10.13053/cys-24-3-3484

Ground Truth Spanish Automatic Extractive Text Summarization Bounds

Autores/as

Griselda Areli Matias Mendoza Universidad Autónoma del Estado de México
Yulia Ledeneva Autonomous University of Barcelona, Department of system analysis and informatics
René Arnulfo García Hernández Universidad Autónoma del Estado de México
Mikhail Alexandrov Autonomous University of Barcelona, Department of system analysis and informatics
Ángel Hernández Castañeda Universidad Autónoma del Estado de México

DOI:

https://doi.org/10.13053/cys-24-3-3484

Palabras clave:

Spanish automatic text summarization, ROUGE, ROUGE-C, jensen shannon divergence, corpus TER

Resumen

The textual information has accelerated growth in the most spoken languages by native Internet users, such as Chinese, Spanish, English, Arabic, Hindi, Portuguese, Bengali, Russian, among others. It is necessary to innovate the methods of Automatic Text Summarization (ATS) that can extract essential information without reading the entire text. The most competent methods are Extractive ATS (EATS) that extract essential parts of the document (sentences, phrases, or paragraphs) to compose a summary. During the last 60 years of research of EATS, the creation of standard corpus with human-generated summaries and evaluation methods which are highly correlated with human judgments help to increase the number of new state of the art methods. However, these methods are mainly supported for the English language, leaving aside other equally important languages such as Spanish, which is the second most spoken language by natives and the third most used on the Internet. A standard corpus for Spanish EATS (SAETS) is created to evaluate the state-of-the-art methods and systems for the Spanish language. The main contribution consists of a proposal for configuration and evaluation of 5 state-of-the-art methods, five systems and four heuristics using three evaluation methods (ROUGE, ROUGE-C, and Jensen-Shannon divergence). It is the first time that Jensen-Shannon divergence is used to evaluate AETS. In this paper the ground truth bounds for the Spanish language are presented, which are the heuristics baseline:first, baseline:random, topline and concordance. Also, the ranking of 30 evaluation test of the state-of-the-art methods and systems is calculated that forms a benchmark for SAETS.

Descargas

PDF (English)

Publicado

2020-09-29

Número

Vol. 24 Núm. 3 (2020): 24(3) 2020

Sección

Artículos

Licencia

Transfiero exclusivamente a la revista “Computación y Sistemas”, editada por el Centro de Investigación en Computación (CIC), los Derechos de Autor del artículo antes mencionado, asimismo acepto que no serán transferidos a ninguna otra publicación, en cualquier formato, idioma, medio existente (incluyendo los electrónicos y multimedios) o por desarrollar.

Certifico que el artículo, no ha sido divulgado previamente o sometido simultáneamente a otra publicación y que no contiene materiales cuya publicación violaría los Derechos de Autor u otros derechos de propiedad de cualquier persona, empresa o institución. Certifico además que tengo autorización de la institución o empresa donde trabajo o estudio para publicar este Trabajo.

El autor, representante acepta la responsabilidad por la publicación del Trabajo en nombre de todos y cada uno de los autores.

Esta Transferencia está sujeta a las siguientes reservas:

Los autores conservan todos los derechos de propiedad (tales como derechos de patente) de este Trabajo, con excepción de los derechos de publicación transferidos al CIC, mediante este documento.
Los autores conservan el derecho de publicar el Trabajo total o parcialmente en cualquier libro del que ellos sean autores o editores y hacer uso personal de este trabajo en conferencias, cursos, páginas web personal, etc.

Ground Truth Spanish Automatic Extractive Text Summarization Bounds

Autores/as

DOI:

Palabras clave:

Resumen

Descargas

Publicado

Número

Sección

Licencia

Desarrollado por

Información

Idioma