Algorithmic Detection of Semantic Similarity WWW 2005.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Information Retrieval in Practice
Architecture of a Search Engine
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Link Analysis, PageRank and Search Engines on the Web
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
Overview of Search Engines
Presented By: - Chandrika B N
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Adversarial Information Retrieval The Manipulation of Web Content.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Topics and Transitions: Investigation of User Search Behavior Xuehua Shen, Susan Dumais, Eric Horvitz.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Querying Structured Text in an XML Database By Xuemei Luo.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Searching and Browsing Using Tags Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Post-Ranking query suggestion by diversifying search Chao Wang.
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Text & Web Mining 9/22/2018.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Networks: State of the Art
Presentation transcript:

Algorithmic Detection of Semantic Similarity WWW 2005

2 Outline Abstract Introduction Semantic Similarity  Tree-Based Similarity  Graph-Based Similarity Evaluation  Analysis of Differences  Validation by User Study Applications  Combining Content and Link Similarity  Evaluating Ranking Function Discussion

3 Abstract Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. The assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Focus on human-generated metadata :  Namely : Topical directories  Measure semantic relationships among massive numbers of pairs of Web pages or topics.  The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived.

4 Introduction Open Directory Project (ODP)   A large human edited directory of the Web  ODP classifies millions of URLs in a topical ontology. Ontologies help to make sense out of a set of objects  ODP provides a rich source from which measurements of semantic similarity between Web pages.  ODP has various types of cross-reference links between categories, so that a node may have multiple parent nodes, and even cycles are present.

5 Semantic Similarity. Tree-Based similarity :, where t 0 (t 1,t 2 ) is the lowest common ancestor topic for t 1 and t 2 in the tree,and Pr[t] represents the prior probability, computed by counting the fraction of pages stored in subtree rooted at node t (subtree(t)). Given two documents d 1 and d 2 in a topic taxonomy the semantic similarity between them is estimated as.

6 Semantic Similarity Graph-Based Similarity - 1/8 The extension of to an ontology graph raises two questions.  i. how to find the most specific common ancestor of a pair of topics in a graph;  Ii. how to extend the definition of subtree rooted at a topic for the graph case.

7 Semantic Similarity Graph-Based Similarity - 2/8 The ODP ontology is a directed graph G = (V,E) where:  V is a set of nodes, representing topics containing documents;  E is a set of edges between nodes in V, partitioned into three subsets : T : “is-a” links S : “symbolic” cross-links R : “related” cross-links.

8 Semantic Similarity Graph-Based Similarity - 3/8 Different types of edges have different meanings and should be used accordingly. One way to distinguish the role of different edges is to assign them weights, and to vary these weights according to the edge’s type. The weight setting we have adopted for the edges in the ODP graph is as follows:  w ij =  for (i, j) T, w ij =  for (i, j) S, and w ij =  for (i, j) R. We set  =  = 1 because symbolic links seem to be treated as first-class taxonomy (“is-a”) links in the ODP Web interface.   =0.5

9 Semantic Similarity Graph-Based Similarity - 4/8 Defined ontology graph W :  let w ij > 0 if and only if there is an edge of some type between topics t i and t j.  Let t i ↓ be the family of topics t j such that either i = j or there is a path (e 1,..., e n ) satisfying: t j t i ↓ if there is a directed path in the graph G from t i to t j, where at most one edge from S or R participates in the path.

10 Semantic Similarity Graph-Based Similarity - 5/8 In order to make the implicit membership relations explicit, we represent the graph structure by means of adjacency matrices. Matrix T is used to represent the hierarchical structure of an ontology. Graph G=T v S v R

11 Semantic Similarity Graph-Based Similarity - 6/8 MaxProduct fuzzy composition function ⊙ defined on matrices as follows: Let T (0) = T and T (r+1) = T (0) ⊙ T (r). We define the closure of T, denoted T + as follows:

12 Semantic Similarity Graph-Based Similarity - 7/8 ⊙⊙

13 Semantic Similarity Graph-Based Similarity - 8/8 The semantic similarity between two topics t 1 and t 2 in an ontology graph can now be estimated as follows:  The probability Pr[t k ] represents the prior probability that any document is classified under topic t k and is computed as:  The posterior probability Pr[t i |t k ] represents the probability that any document will be classified under topic t i given that it is classified under t k, and is computed as follows:

14 Evaluation – Analysis of Differences The portion of the ODP graph we have used for our analysis consists of more than half million topic nodes (only World and Regional categories were discarded). Computing semantic similarity for each pair of nodes in such a huge graph required more than 5,000 CPU hours on IU’s Analysis and AVIDD supercomputer facility. The computed graph-based semantic similarity measurements in compressed format occupies more than 1 TB of IU’s Massive Data Storage System.

15 Evaluation – Analysis of Differences Each coordinate encodes how many pairs of pages in the ODP have semantic similarities falling in the corresponding bin. Significant numbers of pairs yield, indicating that the graph-based measure indeed captures semantic relationships that are missed by the tree-based measure.

16 Evaluation – Analysis of Differences

17 Evaluation – Validation by User Study Human judgments :  38 volunteer subjects  a 30-minute experiment  30 questions about similarity between Web pages. Total of 6 target Web pages randomly selected from the ODP directory. For each target Web page we presented a series of 5 pairs of candidate Web pages.

18

19 Evaluation – Validation by User Study

20 Applications Similarity measure pair of pages :   c based on textual content with TF-IDF.   based on hyperlinks with LF-IDF. Hyperlinks (URLs) are used in place of words (terms). A page link vector is composed of its outlinks, inlinks, and the pages’s own URL.

21 Applications Text and links were extracted from the 1.12 × 10 6 Web pages of the ODP ontology,  c [0, 1] and  [0, 1] were computed for each of 1.26 × pairs of pages. Combining Content and Link Similarity: Considered a number of simple functions f(  c,  ) including:

22

23 Applications Evaluating Ranking Function

24 Discussion Graph semantic similarity measure predicts human judgments of relatedness with significantly greater accuracy than the tree- based measure. Ranking algorithms based on semantic similarity can be applied to arbitrary combinations.  text analysis (e.g. LSA, query expansion, tag weighting, etc.)  link analysis (e.g. authority, PageRank, SiteRank, etc.)  any other features available to a search engine (e.g. freshness, click-through rate, etc.).

25 Discussion We are currently exploring alternative ways to approximate semantic similarity by integrating content and link similarity. The evaluations outlined here have focused on purely local text and link analysis.  Non-looked at the role of more global link analysis such as PageRank  Non-used text analysis techniques such as LSA