Dependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition

Authors

  • Hiram Calvo Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)
  • Andrea Segura-Olivares Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)
  • Alejandro García Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)

DOI:

https://doi.org/10.13053/cys-18-3-2044

Keywords:

Paraphrase recognition, Microsoft Research paraphrase corpus, similarity measures, syntactic ngrams, constituent analysis, dependency analysis.

Abstract

Paraphrase recognition consists in detectingif an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

Author Biographies

Hiram Calvo, Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)

Hiram Calvo obtained his PhD degree in Computer Science (with honors) in 2006 from Centro de Investigación en Computación (CIC) of the Instituto Politécnico Nacional (IPN), Mexico. His thesis was about the Spanish syntax analyzer DILUCT. He was awarded with the Lázaro Cárdenas Prize in 2006; this Prize was handed personally by the President of Mexico. He did a postdoctoral stay at the Nara Institute of Science and Technology, Japan, from 2008 to 2010. Since 2006 he is a lecturer at CIC-IPN. His research interests are lexical semantics, similarity measures, and statistical language models.

Andrea Segura-Olivares, Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)

Andrea Segura-Olivares is a MSc student at CICIPN. Her research interests include Natural Language Processing and intelligent Human-Computer Interfaces. She obtained her BSc degree in Escuela Superior de Cómputo (ESCOM-IPN) in 2012 with honours.

Alejandro García, Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)

Alejandro García is a MSc student at CIC-IPN. His research interests include Natural Language Processing and intelligent videogame design with natural language interaction. He obtained his BSc degree in Escuela Superior de Cómputo (ESCOM-IPN) in 2012 with honours.

Downloads

Published

2014-09-29