IJIRST –International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 12 | May 2017 ISSN (online): 2349-6010
A Survey on Semantic Similarity Measures Aditi Gupta PG Student Department of Computer Science & Engineering JSS Academy of Technical Education, Noida
Mr. Ajay Kumar Assistant Professor Department of Computer Science & Engineering JSS Academy of Technical Education, Noida
Dr. Jyoti Gautam Head of Department & Associate Professor Department of Computer Science & Engineering JSS Academy of Technical Education, Noida
Abstract Measuring the semantic similarity between words, sentences and concepts is an important task in information retrieval, document clustering, web mining and word sense disambiguation. Semantic similarity is basically a measure used to compute the extent of similarity between two concepts based on the likeliness of their meaning. This survey discusses the existing similarity measures by partitioning them into two approaches: Corpus-based and Knowledge-based. The features, performance, advantages and disadvantages of various semantic similarity measures are discussed. The aim of this paper is to provide an efficient evaluation of all these measures and help the researchers to select the best measure according to their requirement. Keywords: Semantic similarity, Corpus-based similarity, Knowledge-based similarity, Semantic relatedness _______________________________________________________________________________________________________ I.
INTRODUCTION
Semantic similarity plays an iconic role in the field of data processing, artificial intelligence and data mining. It is also useful in information management, especially in the context of environment such as semantic web where data may originate from different sources and has to be integrated in flexible way. Semantic similarity is basically a measure used to compute the extent of similarity between two concepts based on the likeliness of their meaning. The concepts can be sentences, words or paragraphs. It finds the distance between different concepts in semantic space in such a way that lesser the distance, greater the similarity. In other words, semantic similarity identifies the common characteristics. Concepts can be similar in two ways that is either lexically or semantically. If words are having a similar character sequence then they are lexically similar, while semantics is concerned with the meaning of the words. Different words or concepts are said to be semantically similar if their meaning is same even if their lexical structure is different. The techniques used to find the semantic similarity between different words can be extended to find similarity between phrases, sentences or paragraphs. Lexical similarity is presented using different string based algorithm and semantic similarity is presented using Corpus based and Knowledge based algorithms. Semantic similarity measures are being intensively used in various applications of knowledge based and semantic information retrieval systems for identifying an optimal match between user query terms and documents. It is also used in word sense disambiguation for identifying the correct sense of the term in the given context. Semantic similarity and semantic relatedness are two different concepts but related to each other. For example, “mother” and “child” are related terms but are not similar since they have different meaning. The survey paper is structured as follows: Section two highlights the related work in the field of semantic similarity measures. Section three describes the Corpus based similarity measures and Section four describes the knowledge based similarity measure. Section five highlights various semantic similarity measures in a tabular form. Section six presents the summary of the survey in the form of conclusion. II. RELATED WORK This section presents survey on various semantic similarity measures already been used. Table 1 describes the related work in the field of semantic similarity measures. S.No 1.
2
TITLE Assessment of Semantic Similarity of concepts defined in ontology [24] An ontology based semantic similarity measure for biomedical data-application to radiology
Table – 1 Related work in semantic similarity measures PUBLISHER YEAR OBJECTIVE A method is proposed to determine similarity between Parisa D and Marek 2013 concepts defined in ontology. It focuses on relation between Reformat concepts and their semantic relation. Thusitha Mabotuwana, A notion of semantic similarity is used to overcome the Michael C. Lee and 2013 limitation of direct concept matching. Eric V.
All rights reserved by www.ijirst.org
243
A Survey on Semantic Similarity Measures (IJIRST/ Volume 3 / Issue 12/ 038)
3.
4.
5.
6.
reports [18] A hybrid knowledge based and data driven approach to identifying semantically similar concepts [19] Ontology based semantic similarity: A new feature based approach [25] Unsupervised semantic similarity computation between terms using web documents [26] Semantic similarity estimation in the biomedical domain: An ontology based information theoretic perspective [20]
Rimma Pivovarov and Noimei Elhadad David Sanchez, Montserrat Battet, David Isern and Aida Valls Elias Iosif David Sanchez and Montserrat Batet
2012
2012
2012
2011
7.
Measuring semantic similarity between biomedical concepts within multiple Ontologies [23]
Hisham Al-Mubaid and Hoa A. Nguyen
2009
8.
An approach for measuring semantic similarity measure between words using multiple information sources [21]
Y. Li, Z.A. Bandar and D. Mclean
2003
A comprehensive method is proposed which computes a similarity score for a concept pair by combining data driven and ontology driven knowledge. Ontology based approaches such as edge counting, feature based approach and measures based on information content are classified and a new ontology based measure relying on the exploitation of taxonomical features is proposed. The proposed method requires only context based metric for web documents search and does not require other metrics such as page counting, ontology, snippets. In this along with the information content based semantic measure, new ontology based edge counting measure in terms of information content are redefined. The proposed algorithm is based on three features: cross modified path length between different concepts, new feature for common specifity of concepts in the ontology and local granularity of ontology clusters. In this a new measure is proposed to measure semantic similarity which combines information non linearly.
III. CORPUS BASED SIMILARITY CORPUS refers to a large collection of written material and speeches that are used to study and describe a language. A huge collection of documents and information is available online which can be used in determining the semantic relatedness among different words or concepts. Figure 1 describes various CORPUS based similarity measures.
Fig. 1: CORPUS based similarity measures
Latent Semantic Analysis (LSA): It is one of the earliest semantic similarity method and is proposed by Landauer [27]. LSA is based on the assumption that words will occur in similar text if they are having similar meaning. LSA model analyses the text, calculates the term frequencies and creates a weighted matrix. In the matrix, rows represent the unique terms in the text, column represent the text sample. Cells of the matrix represent the frequency of each term in the given text sample. A mathematical technique called single valued decomposition (SVD) is used to reduce the dimensionality in such a way that it reduces the number of columns while preserving the number of rows. Terms are then compared by evaluating the cosine of the angle between the vectors formed by any two rows. The LSA model surpasses the limitation of traditional vector space model such as high dimensionality and sparseness. Generalized Latent Semantic Analysis (GLSA): It is the extension of LSA approach in such a way that it focuses on the term vector instead of document-term representation. In GLSA[3], weighted matrix is constructed in the last step to provide weights to the term vectors unlike LSA technique. GLSA measures the semantic relatedness between terms followed by dimensionality reduction technique. Since LSA uses only cosine function to measure the similarity between terms vectors, GLSA uses any measure of the semantic association between terms and a method of dimensionality reduction. GLSA is a better approach compared to LSA.
All rights reserved by www.ijirst.org
244
A Survey on Semantic Similarity Measures (IJIRST/ Volume 3 / Issue 12/ 038)
Hyper-space Analogue to Language (HAL): It uses the word co-occurrences to form a semantic space. In this[1,2] a matrix is created in which each row and column both represents the word. Each matrix element represents the strength of association of corresponding words of row and column. As the text is analyzed, focused word is selected and is compared with the neighboring words which are termed as co-occurring. The co-occurrence is inversely proportional to the distance from the focus word and the values are in the matrix. Dimensional reduction can be opted by dropping any column having low entropy. Moreover, HAL focuses on the pattern of word occurrences in such a way that it treats co-occurrences differently by checking if they appeared before or after the focus word. Semantic Similarity using Web Search Engines: This approach [16] uses web search results to find semantic relatedness between two words. For the semantic relatedness, it uses page count and snippets provided by the web search. Suppose, similarity between two words A and B need to be found out. The method searches A and B separately and finds the page count. In the similar way, the combined query A & B is searched and page count is determined. The snippets are used to find the term frequencies of A & B. The observed patterns are ranked according to their ability to semantically co-relate different words. Similarity scores of page counts and snippets are integrated together using support vector machine (SVM) to evaluate semantic similarity between words. Similarity based on Corpus Statistics and Lexical Taxonomy: This is a proposed novel approach [9] in which lexical taxonomy is integrated with the statistical information derived from the corpus. The proposed method enhances the edge-based approach of similarity calculation by adding the information based of node based methods. In this method both edge based and node based calculation is done for computing the similarity and hence this algorithm claims to outperform other computational methods. Explicit Semantic Analysis: In this method[4], semantic relatedness between any two arbitrary texts is computed. Meaning of the word is represented in a weighted vector form using Wikipedia. Words are mapped as high dimensional vectors and each vector represents the TF-IDF weight between the Wikipedia article and word. The semantic similarity among the generated vectors is computed using cosine similarity measure. The Cross Language Explicit Semantic Analysis: This is a multilingual generalization of ESA. In this[5] semantic similarity of two documents in different languages is computed using cosine similarity function. IV. KNOWLEDGE BASED SIMILARITY Ontology, taxonomies and semantic net are the knowledge representation forms which are used in Information Retrieval and these knowledge representations are used by various methods to find semantic similarity between different terms or concepts. Knowledge based similarity is one of the semantic similarity measure which uses information derived from semantic networks for identifying the degree of similarity between words. Wordnet is the widely used semantic network. It is a large lexical database of English in which verbs, nouns, adverbs and adjectives are grouped into similar meaning sets known as synsets. And these synsets are linked with each other by means of conceptual semantic and lexical relations. Knowledge based similarity measures can be classified as measures of semantic similarity and measures of semantic relatedness. Semantically similar approaches use information content for the evaluation of similarity. These are also termed as node based approaches. Semantically related approaches deals with the general notion of relatedness and not with the concept. These are also termed as edge based approaches and they use conceptual distance for the evaluation of similarity. Semantic similarity is also a kind of relatedness between words. There are various measures of semantic similarity. Few of them are based on information content and remaining on path length. These measures are described below in Figure 2.
All rights reserved by www.ijirst.org
245
A Survey on Semantic Similarity Measures (IJIRST/ Volume 3 / Issue 12/ 038)
Fig. 2: Knowledge Based Semantic Measures
Resnik(res)[8] proposed quantifying the information content shared by different concepts in order to measure the similarity between them. More the information is shared, more the concepts are similar to each other. Hence, it measures the relatedness value which is equal to the information content of the Least Common Subsume that is the most informative subsumer and its value is always positive. The Lin(lin)[6] and Jiang and Conrath(jcn)[9] measures expand the information content of the Least Common Subsumer by adding the information content of the concepts themselves. The lin measure augments the information content of Least Common Subsumer by this sum whereas jcn calculates the difference between the sum and the information content of Least Common Subsumer. Leacock and Chodorow(l&c)[10] measures the similarity using path length. It calculates the shortest path that connects the senses of the words and returns the score denoting the similarity of two word senses. Wu and Palmer(wup)[11] measures the similarity by returning a score based on the depth of the senses in the taxonomy and their Least Common Subsumer. Path Length(path) measures the similarity by returning a score denoting the shortest path between the sense of the word in is-a taxonomy. St.Onge (hso)[12] measures the semantic relatedness by finding the lexical chains connecting the word senses. Lesk(lesk)[13] measures the semantic relatedness by finding the overlap in the glossary of the synsets of the terms. The sum of the square of the relatedness length is the semantic relatedness score. Vector pairs(vector)[14] measures the semantic relatedness by creating a cooccurrence matrix for each word and then represent each word as a vector. V. SEMANTIC SIMILARITY MEASURES Table 2 gives a review of various semantic similarity measures. Method
Path Based
Information Content Based
Feature Based
Principle
Length of the path linking different word senses
The concepts sharing common information are similar The concepts having common features are similar
Measure
Table – 2 Semantic Similarity Measures Feature
Advantage
Shortest Path
Number of edges between the concepts
Simple measure
Wu & Palmer
Path length augmented by subsumer path to root
Simple Measure
L&C
Number of edges between the concepts
Simple Measure
Information content of the lowest common subsumer
Simple Measure
Resnik
Lin
Tversky
Information content of the lowest common subsumer and compared concepts Compares features of the concepts
Considers the information content of compared concepts Considers features while computing similarity
Disadvantage Different pairs of equal length and shortest path will have same similarity Different pairs having lowest common subsume and equal length of path will have same similarity Different pairs of equal length and shortest path will have same similarity Different pairs having lowest common subsume will have same similarity Different pair having the same summation of information content will have same similarity Computationally complex
All rights reserved by www.ijirst.org
246
A Survey on Semantic Similarity Measures (IJIRST/ Volume 3 / Issue 12/ 038)
VI. CONCLUSION Semantic similarity plays a vital role in the field of Information Retrieval, Natural Language Processing and Artificial Intelligence. Based in the thorough literature study, this paper gives a brief review of various semantic similarity measures. Two types of semantic similarity measure are discussed: CORPUS based and knowledge based. CORPUS based similarity is a semantic similarity measure that determines the similarity between the concepts through the information obtained from large corpora. Algorithms such as LSA, GLSA, HAL, ESA, CL-ESA, Web service engine and statistical and lexical taxonomy are discussed. Knowledge based semantic measures are the measures that uses information gained from semantic networks to compute similarity among the concepts. Semantic similarity algorithms such as res, lin, jcn, lch, wup and Semantic relatedness algorithms such as hso, lesk and vector are discussed. This paper provides an efficient evaluation of all these measures and helps the researchers to select the best measure according to their requirement. VII. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26]
Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic and associative priming in a high-dimensional semantic space. Cognitive Science Proceedings (LEA), 660-665. Lund, K. & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers, 28(2),203-208. Matveeva, I., Levow, G., Farahat, A. & Royer, C. (2005). Generalized latent semantic analysis for term representation. In Proc. of RANLP. Gabrilovich E. & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6–12. Martin, P., Benno, S. & Maik, A.(2008). A Wikipedia-based multilingual retrieval model. Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530. Lin, D. (1998b). Extracting Collocations from Text Corpora. In Workshop on Computational Terminology , Montreal, Kanada, 57–63. Mihalcea, R., Corley, C. & Strapparava, C. (2006). Corpus based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Artificial Intelligence.(Boston, MA). Resnik, R. (1995). Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada. Jiang, J. & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics, Taiwan. Leacock, C. & Chodorow, M. (1998). Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. The MIT Press. Wu, Z.& Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico. Hirst, G. & St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database , pp 305–332. MIT Press. Banerjee ,S. & Pedersen, T.(2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, , Mexico City, pp 136–145. Patwardhan, V.( 2003). Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University of Minnesota, Duluth. Tversky, A. 1977. Features of Similarity. Psycological Review, 84(4):327-352. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, “Measuring semantic similarity between words using web search engines”, WWW, vol 7, pp. 757—766, 2007. Thabet Silmani. Description and evaluation of semantic similarity measures approaches. International Journal of Computer Applications(0975-8887). Volume 80- No.10, October 2013. Thusitha Mabotuwana et al. An ontology based similarity measure for biomedical data- Application to radiology reports. Journal of Biomedical Informatics; 2013. http://dx.doi.org/10.1016 /j.jbi.2013.06.013 Pivovarov R and Elhadad N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. Journal of Biomedical Informatics. 2012; 45(3):471–81. David Sanchez and Montserrat Batet. Semantic similarity estimation in the biomedical domain: An ontology-based information theoretic perspective. Journal of Biomedical Informatics 44 (2011) 749–759. doi:10.1016/j.jbi.2011.03.013 Yuhua Li, et al. An approach for measuring semantic similarity measure between words using multiple information sources. IEEE transactions on knowledge and data engineering, vol.15, no.4, july/august 2003. Hisham Al-Mubaid and Hoa A.Nguyen. Measuring semantic similarity between biomedical concepts within multiple ontologies. IEEE transactions on systems, man and cybernetics-part c: applications and reviews, vol.39, no.4, july 2009. Parisa D, et al. Assessment of semantic similarity of concepts defined in ontology. Journal of Information Sciences (2013). Doi: 10.1016/j.ins.2013.06.056. David Sanchez, et al. Ontology based semantic similarity: A new feature-based approach. Journal of expert systems with applications 39(2012) 7718-7728. Elias Losif and Alexandros Potamianos. Unsupervised semantic similarity computation between terms using web documents. IEEE transactions on knowledge engineering, vol.22, no.11, November 2012. Landauer, T.K. & Dumais, S.T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, 104.
All rights reserved by www.ijirst.org
247