SIMTEX: An Approach for Detecting and Measuring Textual Similarity based on Discourse and Semantics

Autores/as

  • Iria da Cunha University Institute for Applied Linguistics
  • Jorge Vivaldi University Institute for Applied Linguistics
  • Juan Manuel Torres-Moreno LIA/Agorantic/Universit ´e d’Avignon et des Pays de Vaucluse
  • Gerardo Sierra Universidad Nacional Aut ´onoma de M´ exico

DOI:

https://doi.org/10.13053/cys-18-3-2033

Palabras clave:

Textual similarity, discourse, semantics, paraphrase.

Resumen

Nowadays automatic systems for detectingand measuring textual similarity are being developed,in order to apply them to different tasks in the field ofNatural Language Processing (NLP). Currently, thesesystems use surface linguistic features or statistical information.Nowadays, few researchers use deep linguisticinformation. In this work, we present an algorithm fordetecting and measuring textual similarity that takes intoaccount information offered by discourse relations ofRhetorical Structure Theory (RST), and lexical-semanticrelations included in EuroWordNet. We apply the algorithm,called SIMTEX, to texts written in Spanish, but themethodology is potentially language-independent

Biografía del autor/a

Iria da Cunha, University Institute for Applied Linguistics

holds a Ph.D. in Applied Linguisticsfrom the Universitat Pompeu Fabra (UPF)in Barcelona. Nowadays, she holds a Juan dela Cierva research contract in the framework ofthe group IULATERM (Lexicon and Technology),from the University Institute for Applied Linguistics(IULA). Also, she is associated lecturer at the Facultyof Translation and Interpretation of UPF. Hermain research lines are discourse parsing, automaticsummarization, specialized discourse analysisand terminology.

Jorge Vivaldi, University Institute for Applied Linguistics

obtained his Ph.D. degreefrom the Polytechnical University of Catalonia witha dissertation focused on extracting terms fromwritten texts in the biomedical area. Currently,he is a researcher at the University Institute forApplied Linguistics, Universitat Pompeu Fabra inBarcelona, where he is responsible for the coordinationof several projects dealing with corpusprocessing and information extraction. His areasof interest are mainly related to natural languageprocessing, both resources compilation and toolsdevelopment.

Juan Manuel Torres-Moreno, LIA/Agorantic/Universit ´e d’Avignon et des Pays de Vaucluse

obtained his Ph.D.degree in Computer Science (Neural Networks)from Institut National Polytechnique de Grenobleand his HDR degree from Laboratoire Informatiqued’Avignon (LIA). Nowadays he is full Professor atthe LIA (Universite d’Avignon et des Pays de Vaucluse),where he is responsible of the NLP team(TALNE), and for the coordination of projects withinformation extraction. His areas of interest aremainly related to NLP, information extraction andautomatic text summarization.

Gerardo Sierra, Universidad Nacional Aut ´onoma de M´ exico

is a National Reseacherof Mexico. He leads the Grupo de Ingenier´ıaLing¨u´ıstica at the Instituto de Ingenier´ıa of the UniversidadNacional Aut ´onoma de M´exico (UNAM).He holds a Ph.D. in Computational Linguistics fromthe University of Manchester, Institute of Scienceand Technology (UMIST), UK. His research interestis focused on language engineering and includescomputational lexicography, concept extraction,corpus linguistics, text mining and forensiclinguistics.

Descargas

Publicado

2014-09-29