Semantic Textual Similarity Methods, Tools, and Applications: A Survey

Goutam Majumder; Partha Pakray; Alexander Gelbukh; David Pinto

doi:10.13053/cys-20-4-2506

Autores/as

Goutam Majumder National Institute of Technology Mizoram, Aizawl
Partha Pakray National Institute of Technology Mizoram, Aizawl
Alexander Gelbukh Instituto Politécnico Nacional, CIC, México City
David Pinto Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science, México

DOI:

https://doi.org/10.13053/cys-20-4-2506

Palabras clave:

WordNet taxonomy, natural language processing, semantic textual similarity, information content, random walk, statistical similarity, cosine similarity, term-based similarity, character-based similarity, n-gram, Jaccard similarity, WordNet similarity.

Resumen

Measuring Semantic Textual Similarity (STS), between words/ terms, sentences, paragraph and document plays an important role in computer science and computational linguistic. It also has many applications over several fields such as Biomedical Informatics and Geoinformation. In this paper, we present a surveyon different methods of textual similarity and we also reported about the availability of different software and tools those are useful for STS. In natural language processing (NLP), STS is a important component formany tasks such as document summarization, word sense disambiguation, short answer grading, information retrieval and extraction. We split out the measuresfor semantic similarity into three broad categoriessuch as (i) Topological/Knowledge-based (ii) Statistical/Corpus Based (iii) String based. More emphasisis given to the methods related to the WordNet taxonomy. Because topological methods, plays an important role to understand intended meaning of anambiguous word, which is very difficult to process computationally. We also propose a new method forme asuring semantic similarity between sentences. This proposed method, uses the advantages of taxonomy methods and merge these information to a language model. It considers the WordNet synsets for lexical relation ships between nodes/words and a uni-gram language model is implemented over a large corpus to assign the information content value between the two nodes of different classes.

Biografía del autor/a

Goutam Majumder, National Institute of Technology Mizoram, Aizawl

Received his M. Tech degree in Computer Science & Engineering of Tripura University (A Central University), India as a first rank holder. He is currently Ph.D scholar and Assistant Professor at the Department of Computer Science & Engineering of the National Institute of Technology Mizoram. His working interest in image processing and natural language processing. He was also worked as a reserach associate in Bio-Metrics Laboratory of Computer Science & Engineering Department of Tripura University (A Central University).

Partha Pakray, National Institute of Technology Mizoram, Aizawl

Received his Ph.D. degree in Computer Science and Engineering from the Jadavpur University, India. He is currently Head and Assistant Professor at the Department of Computer Science and Engineering of the National Institute of Technology Mizoram. He received fellowship from European Research Consortium for Informatics and Mathematics (ERCIM) for two times and worked at the Norwegian University of Science and Technology, Norway, and the Masaryk University, Czech Republic, as a postdoctoral fellow. He also worked at the Xerox Research Centre Europe (XRCE) as a research intern. He has published 50 research publications in various areas of NLP.

Alexander Gelbukh, Instituto Politécnico Nacional, CIC, México City

Received his M.Sc. degree in Mathematics from the Lomonosov Moscow State University, Russia, and his Ph.D. in Computer Science from VINITI, Russia. He is currently a Research Professor and Head of the Natural Language Processing Laboratory of the Center for Computing Research (Centro de Investigación in Computación, CIC) of the Instituto Politécnico Nacional (IPN), México. He is a former President of the Mexican Society of Artificial Intelligence (SMIA), a Member of the Mexican Academy of Sciences, and a National Researcher of México (SNI) at excellence level 2. He is author or coauthor of more than 500 research publications in natural language processing and artificial intelligence.

David Pinto, Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science, México

Received his PhD in computer science in the area of artificial intelligence and pattern recognition at the Polytechnic University of Valencia, Spain in 2008. At present he is a full time professor at the Faculty of Computer Science of the Benemérita Universidad Autónoma de Puebla (BUAP) where he leads the PhD program on Language & Knowledge Engineering. His areas of interest include clustering, information retrieval, crosslingual NLP tasks and computational linguistics in general. He has published more than 100 research publications in NLP and artificial intelligence.

Semantic Textual Similarity Methods, Tools, and Applications: A Survey

Autores/as

DOI:

Palabras clave:

Resumen

Biografía del autor/a

Goutam Majumder, National Institute of Technology Mizoram, Aizawl

Partha Pakray, National Institute of Technology Mizoram, Aizawl

Alexander Gelbukh, Instituto Politécnico Nacional, CIC, México City

David Pinto, Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science, México

Descargas

Publicado

Número

Sección

Licencia

Desarrollado por

Información

Idioma