Semantic Textual Similarity Methods, Tools, and Applications: A Survey

Goutam Majumder; Partha Pakray; Alexander Gelbukh; David Pinto

doi:10.13053/cys-20-4-2506

Authors

Goutam Majumder National Institute of Technology Mizoram, Aizawl
Partha Pakray National Institute of Technology Mizoram, Aizawl
Alexander Gelbukh Instituto Politécnico Nacional, CIC
David Pinto Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science

DOI:

https://doi.org/10.13053/cys-20-4-2506

Keywords:

WordNet taxonomy, natural language processing, semantic textual similarity, information content, random walk, statistical similarity, cosine similarity, term-based similarity, character-based similarity, n-gram, , Jaccard similarity, WordNet similarity.

Abstract

Measuring Semantic Textual Similarity (STS), between words/ terms, sentences, paragraph and document plays an important role in computer science and computational linguistic. It also has many application sover several fields such as Biomedical Informatics and Geoinformation. In this paper, we present a survey on different methods of textual similarity and we also reported about the availability of different software and tools those are useful for STS. In natural language processing (NLP), STS is a important component formany tasks such as document summarization, word sense disambiguation, short answer grading, information retrieval and extraction. We split out the measures for semantic similarity into three broad categories such as (i) Topological/Knowledge-based (ii) Statistical/Corpus Based (iii) String based. More emphasisi s given to the methods related to the WordNet taxonomy. Because topological methods, plays an important role to understand intended meaning of an ambiguous word, which is very difficult to process computationally. We also propose a new method for measuring semantic similarity between sentences. This proposed method, uses the advantages of taxonomy methods and merge these information to a language model. It considers the WordNet synsets for lexical relationships between nodes/words and a uni-gram language model is implemented over a large corpus to assign the information content value between the two nodes of different classes.

Author Biographies

Goutam Majumder, National Institute of Technology Mizoram, Aizawl

Received his M. Tech degree in Computer Science & Engineering of Tripura University (A Central University), India as a first rank holder. He is currently Ph.D scholar and Assistant Professor at the Department of Computer Science & Engineering of the National Institute of Technology Mizoram. His working interest in image processing and natural language processing. He was also worked as a reserach associate in Bio-Metrics Laboratory of Computer Science & Engineering Department of Tripura University (A Central University).

Partha Pakray, National Institute of Technology Mizoram, Aizawl

Received his Ph.D. degree in Computer Science and Engineering from the Jadavpur University, India. He is currently Head and Assistant Professor at the Department of Computer Science and Engineering of the National Institute of Technology Mizoram. He received fellowship from European Research Consortium for Informatics and Mathematics (ERCIM) for two times and worked at the Norwegian University of Science and Technology, Norway, and the Masaryk University, Czech Republic, as a postdoctoral fellow. He also worked at the Xerox Research Centre Europe (XRCE) as a research intern. He has published 50 research publications in various areas of NLP.

Alexander Gelbukh, Instituto Politécnico Nacional, CIC

Received his M.Sc. degree in Mathematics from the Lomonosov Moscow State University, Russia, and his Ph.D. in Computer Science from VINITI, Russia. He is currently a Research Professor and Head of the Natural Language Processing Laboratory of the Center for Computing Research (Centro de Investigación in Computación, CIC) of the Instituto Politécnico Nacional (IPN), México. He is a former President of the Mexican Society of Artificial Intelligence (SMIA), a Member of the Mexican Academy of Sciences, and a National Researcher of México (SNI) at excellence level 2. He is author or coauthor of more than 500 research publications in natural language processing and artificial intelligence.

David Pinto, Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science

Received his PhD in computer science in the area of artificial intelligence and pattern recognition at the Polytechnic University of Valencia, Spain in 2008. At present he is a full time professor at the Faculty of Computer Science of the Benemérita Universidad Autónoma de Puebla (BUAP) where he leads the PhD program on Language & Knowledge Engineering. His areas of interest include clustering, information retrieval, crosslingual NLP tasks and computational linguistics in general. He has published more than 100 research publications in NLP and artificial intelligence.

Semantic Textual Similarity Methods, Tools, and Applications: A Survey

Authors

DOI:

Keywords:

Abstract

Author Biographies

Goutam Majumder, National Institute of Technology Mizoram, Aizawl

Partha Pakray, National Institute of Technology Mizoram, Aizawl

Alexander Gelbukh, Instituto Politécnico Nacional, CIC

David Pinto, Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science

Downloads

Published

Issue

Section

License

Developed By

Information

Language