Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek
Are these similar? Number ‘1’ vs. color ‘red’ Number ‘1’ vs. ‘small’ Horse vs. Rider True vs. false ‘Monalisa’ vs. ‘Virgin of the rocks’ We need some universal similarity measure!!!
Information Distance E(x, y) : Given two strings x and y, the length of the shortest binary program, in the reference universal computing system, such that the program computes output y from input x, and also output x from input y is known as the information distance between x and y Up to a negligible logarithmic additive term,
Information Distance Determines the distance between two strings minorizing the dominant feature in which they are similar. Not a good measure – If two small strings differ by an ID which is large compared to their lengths, then the strings are not similar. However, if two very large strings differ by the same distance, they are very similar
Normalized Information Distance To ensure that the ID expresses similarity between two strings, normalize it over the length of the strings Kolmogorov complexity is uncomputable!!!
Normalized Compression Distance To counter the problem of uncomputability of K(x), it’s replaced by C(x) where C is some compression technique
Normalized Google Distance NGD : a way of expressing NCD Premise : – Number of web pages indexed by Google is so vast that is actually approximates the actual relative frequency of various search terms – Invariant to growing size of Google database Uses probability of search terms to define similarity distance
Normalized Google Distance
Errors/Shortcomings Not applicable to small datasets. Definition of N is still erroneous, the NGD can still be greater than 1 (Kraft’s inequality is violated) Ignored the number of occurrence of a term on a page Page count analysis ignore the position of a word in a page – Even though 2 words appear on the same page, they may not be related. Page count of a polysemous word might contain a combination of all its senses, e.g. - Apple. The probability of occurrence of every page is taken to be the same (it is NOT the same, depends on the page rank) Given the scale and noise on the WWW, some words might just occur on a page arbitrarily (by random chance)
Suggestions Can use snippets which are returned along with the web searches to determine the similarity of the words – Only the snippets of the top ranking pages can be processed efficiently and no guarantee exists that all the information we need to measure the semantic similarity is contained in those snippets Combining information from the web-searches with the information obtained from other databases (Wikipedia, WordNet etc.) to find the similarity measure.
SVM – NGD Learning A set of n-dimensional data points along with their classes i.e. (x i, y i ). Only 2 classes taken here for brevity. Divide this set into two parts : training set and testing set Determine a dividing curve to differentiate the points of two classes using learning set. Validate the results by testing on testing set. Image Source: http://groups.csail.mit.edu/ddmg/drupal/sites/default/files/images/2.preview.png
SVM – NGD Learning Set of training words Set of anchor words of cardinality n(much smaller than training words set) Convert each of training word into a n-dimensional vector whose i th dimension is NGD between that word and i th anchor word Train it using SVM
References Cilibrasi, Rudi L., and Paul MB Vitanyi. "The Google similarity distance." Knowledge and Data Engineering, IEEE Transactions on 19.3 (2007): 370-383. Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Measuring semantic similarity between words using web search engines." www 7 (2007): 757-766. http://en.wikipedia.org/wiki/Normalized_Google_distance