Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithmic Detection of Semantic Similarity WWW 2005.

Similar presentations


Presentation on theme: "Algorithmic Detection of Semantic Similarity WWW 2005."— Presentation transcript:

1 Algorithmic Detection of Semantic Similarity WWW 2005

2 2 Outline Abstract Introduction Semantic Similarity  Tree-Based Similarity  Graph-Based Similarity Evaluation  Analysis of Differences  Validation by User Study Applications  Combining Content and Link Similarity  Evaluating Ranking Function Discussion

3 3 Abstract Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. The assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Focus on human-generated metadata :  Namely : Topical directories  Measure semantic relationships among massive numbers of pairs of Web pages or topics.  The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived.

4 4 Introduction Open Directory Project (ODP)  http://dmoz.org  A large human edited directory of the Web  ODP classifies millions of URLs in a topical ontology. Ontologies help to make sense out of a set of objects  ODP provides a rich source from which measurements of semantic similarity between Web pages.  ODP has various types of cross-reference links between categories, so that a node may have multiple parent nodes, and even cycles are present.

5 5 Semantic Similarity. Tree-Based similarity :, where t 0 (t 1,t 2 ) is the lowest common ancestor topic for t 1 and t 2 in the tree,and Pr[t] represents the prior probability, computed by counting the fraction of pages stored in subtree rooted at node t (subtree(t)). Given two documents d 1 and d 2 in a topic taxonomy the semantic similarity between them is estimated as.

6 6 Semantic Similarity Graph-Based Similarity - 1/8 The extension of to an ontology graph raises two questions.  i. how to find the most specific common ancestor of a pair of topics in a graph;  Ii. how to extend the definition of subtree rooted at a topic for the graph case.

7 7 Semantic Similarity Graph-Based Similarity - 2/8 The ODP ontology is a directed graph G = (V,E) where:  V is a set of nodes, representing topics containing documents;  E is a set of edges between nodes in V, partitioned into three subsets : T : “is-a” links S : “symbolic” cross-links R : “related” cross-links.

8 8 Semantic Similarity Graph-Based Similarity - 3/8 Different types of edges have different meanings and should be used accordingly. One way to distinguish the role of different edges is to assign them weights, and to vary these weights according to the edge’s type. The weight setting we have adopted for the edges in the ODP graph is as follows:  w ij =  for (i, j) T, w ij =  for (i, j) S, and w ij =  for (i, j) R. We set  =  = 1 because symbolic links seem to be treated as first-class taxonomy (“is-a”) links in the ODP Web interface.   =0.5

9 9 Semantic Similarity Graph-Based Similarity - 4/8 Defined ontology graph W :  let w ij > 0 if and only if there is an edge of some type between topics t i and t j.  Let t i ↓ be the family of topics t j such that either i = j or there is a path (e 1,..., e n ) satisfying: t j t i ↓ if there is a directed path in the graph G from t i to t j, where at most one edge from S or R participates in the path.

10 10 Semantic Similarity Graph-Based Similarity - 5/8 In order to make the implicit membership relations explicit, we represent the graph structure by means of adjacency matrices. Matrix T is used to represent the hierarchical structure of an ontology. Graph G=T v S v R

11 11 Semantic Similarity Graph-Based Similarity - 6/8 MaxProduct fuzzy composition function ⊙ defined on matrices as follows: Let T (0) = T and T (r+1) = T (0) ⊙ T (r). We define the closure of T, denoted T + as follows:

12 12 Semantic Similarity Graph-Based Similarity - 7/8 ⊙⊙

13 13 Semantic Similarity Graph-Based Similarity - 8/8 The semantic similarity between two topics t 1 and t 2 in an ontology graph can now be estimated as follows:  The probability Pr[t k ] represents the prior probability that any document is classified under topic t k and is computed as:  The posterior probability Pr[t i |t k ] represents the probability that any document will be classified under topic t i given that it is classified under t k, and is computed as follows:

14 14 Evaluation – Analysis of Differences The portion of the ODP graph we have used for our analysis consists of more than half million topic nodes (only World and Regional categories were discarded). Computing semantic similarity for each pair of nodes in such a huge graph required more than 5,000 CPU hours on IU’s Analysis and AVIDD supercomputer facility. The computed graph-based semantic similarity measurements in compressed format occupies more than 1 TB of IU’s Massive Data Storage System.

15 15 Evaluation – Analysis of Differences Each coordinate encodes how many pairs of pages in the ODP have semantic similarities falling in the corresponding bin. Significant numbers of pairs yield, indicating that the graph-based measure indeed captures semantic relationships that are missed by the tree-based measure.

16 16 Evaluation – Analysis of Differences

17 17 Evaluation – Validation by User Study Human judgments :  38 volunteer subjects  a 30-minute experiment  30 questions about similarity between Web pages. Total of 6 target Web pages randomly selected from the ODP directory. For each target Web page we presented a series of 5 pairs of candidate Web pages.

18 18

19 19 Evaluation – Validation by User Study

20 20 Applications Similarity measure pair of pages :   c based on textual content with TF-IDF.   based on hyperlinks with LF-IDF. Hyperlinks (URLs) are used in place of words (terms). A page link vector is composed of its outlinks, inlinks, and the pages’s own URL.

21 21 Applications Text and links were extracted from the 1.12 × 10 6 Web pages of the ODP ontology,  c [0, 1] and  [0, 1] were computed for each of 1.26 × 10 12 pairs of pages. Combining Content and Link Similarity: Considered a number of simple functions f(  c,  ) including:

22 22

23 23 Applications Evaluating Ranking Function

24 24 Discussion Graph semantic similarity measure predicts human judgments of relatedness with significantly greater accuracy than the tree- based measure. Ranking algorithms based on semantic similarity can be applied to arbitrary combinations.  text analysis (e.g. LSA, query expansion, tag weighting, etc.)  link analysis (e.g. authority, PageRank, SiteRank, etc.)  any other features available to a search engine (e.g. freshness, click-through rate, etc.).

25 25 Discussion We are currently exploring alternative ways to approximate semantic similarity by integrating content and link similarity. The evaluations outlined here have focused on purely local text and link analysis.  Non-looked at the role of more global link analysis such as PageRank  Non-used text analysis techniques such as LSA


Download ppt "Algorithmic Detection of Semantic Similarity WWW 2005."

Similar presentations


Ads by Google