Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Pádraig Cunningham University College Dublin Matrix Tutorial Transition Matrices Graphs Random Walks.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
INF 2914 Web Search Lecture 4: Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS347 Lecture 6 April 25, 2001 ©Prabhakar Raghavan.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
PrasadL17LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Lecture 14: Link Analysis
CS276 Lecture 18 Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 18: Link analysis.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 21: Link analysis.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Search and Tex Mining Lecture 9 Link Analysis.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 20 11/3/2011.
Web-Mining Agents Community Analysis Tanya Braun Universität zu Lübeck Institut für Informationssysteme.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Modified by Dongwon Lee from slides by
Quality of a search engine
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis

Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS(Hyperlink-Induced Topic Search): Hubs & Authorities

Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

Introduction to Information Retrieval 4 The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d 1 → d 2 indicates that d 1 ‘s author deems d 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d 2.  We use anchor text somewhat loosely here for: the text surrounding the hyperlink.  Example: “You can find cheap cars ˂a href = ˂/a ˃. ”  Anchor text: “You can find cheap cars here”

Introduction to Information Retrieval 5 [text of d 2 ] only vs. [text of d 2 ] + [anchor text → d 2 ]  Searching on [text of d 2 ] + [anchor text → d 2 ] is oftenmore effective than searching on [text of d 2 ] only.  Example: Query “IBM”  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page!  … if IBM home page is mostly graphics  Searching on [anchor text → d 2 ] is better for the query “IBM”.  In this representation, the page with most occurrences of IBM is

Introduction to Information Retrieval 6 Anchor text containing IBM pointing to

Introduction to Information Retrieval 7 Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than document text.

Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

Introduction to Information Retrieval PageRank  Assigned to every node(page) in the web graph a numerical score between 0 and 1, known as its PageRank.  The idea behind PageRank is that pages visited more often are more important.

Introduction to Information Retrieval 10 Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably.

Introduction to Information Retrieval 11 Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends.  Teleport operation

Introduction to Information Retrieval 12 Teleporting  At a dead end, jump to a random web page with prob. 1/N (N is the total number of nodes in the web graph).  At a non-dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ).  We use the teleport operation in two ways 1.When at a node with no out-links. 2.At any node that has outgoing links, the surfer invokes the teleport operation with probability α and the standard random walk with probility 1-α.

Introduction to Information Retrieval 13 Markov Chains  A Markov chain is characterized by an NxN transition probability matrix P each of whose entries is in the interval [0,1].  The entries in each row of P add up to 1.  The entry P ij tells us the probability that the state at the next time step is j, conditioned on the current state being i.  Each entry P ij is known as a transition probability and depends only on the current state i.  A matrix with non-negative entries that satisfies (b) is known as a stochastic matrix (The largest eigenvalue is 1). (a) (b)

Introduction to Information Retrieval 14 How to get the probability matrix P  The adjacency matrix A of the web graph is defined as follow: if there is a hyperlink from page i to page j, then A ij =1, otherwise A ij =0.  We can derive the transition probability matrix P form the matrix A. 1.If a row of A has no 1’s, then replace each element by 1/N. 2.Divide each 1 in A by the number of 1’s in its row. 3.Multiply the resulting matrix by 1-α. 4.Add α/N to every entry of the resulting matrix, to obtain P.

Introduction to Information Retrieval Web graph - Example Adjacency Matrix A d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval 16 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d Divide each 1 in A by the number of 1’s in its row Web graph - Example

Introduction to Information Retrieval 17 d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d Multiply the resulting matrix by 1-α Web graph – Example(α=0.21)

Introduction to Information Retrieval 18 Add α/N to every entry of the resulting matrix Transition probability matrix P d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d Web graph – Example(α=0.21) d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval 19 Ergodic Markov Chains  A Markov chain is said to be ergodic if there exists a positive integer T 0 such that for all pairs of states i, j in the Markov chain, if it is started at time 0 in state i then for all t >T 0, the probability of being in state j at time t is greater than 0.  Teleport operation makes the web graph ergodic.  For any ergodic Markov chain, there is a unique steady-state probability vector that is the principal left eigenvector of P.  The steady-state probability for a state is the PageRank of the corresponding web page.

Introduction to Information Retrieval 20 Change in probability vector  If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step?  Recall that row i of the transition probability matrix P tells us where we go next from state i.  So from x, our next state is distributed as xP.

Introduction to Information Retrieval 21 Steady state in vector notation  The steady state in vector notation is simply a vector  = (  ,  , …,   ) of probabilities.  (We use  to distinguish it from the notation for the probability vector x.)   is the long-term visit rate (or PageRank) of page i.  So we can think of PageRank as a very long vector – one entry per page.

Introduction to Information Retrieval 22 How do we compute the steady state vector?  In other words: how do we compute PageRank?  Recall:  = (  1,  2, …,  N ) is the PageRank vector, the vector of steady-state probabilities...  … and if the distribution in this step is x, then the distribution in the next step is xP.  But  is the steady state!  So:  =  P  Solving this matrix equation gives us .   is the principal left eigenvector for P …  … that is,  is the left eigenvector with the largest eigenvalue.  All transition probability matrices have largest eigenvalue 1.

Introduction to Information Retrieval 23 One way of computing the PageRank   Start with any distribution x, e.g., uniform distribution  After one step, we’re at xP.  After two steps, we’re at xP 2.  After k steps, we’re at xP k.  Algorithm: multiply x by increasing powers of P until convergence.  This method is called the power iteration.  Recall: regardless of where we start, we eventually reach the steady state 

Introduction to Information Retrieval 24 Power iteration: Example  What is the PageRank / steady state in this example?

Introduction to Information Retrieval 25 Computing PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t0t = xP t1t = xP 2 t2t = xP 3 t3t = xP 4... t∞t∞ = xP ∞ PageRank vector =  = (  ,   ) = (0.25, 0.75)

Introduction to Information Retrieval is the principle left eigenvector of P d1d1 d2d2 d1d d2d =x x SkSk TkTk = S0S0 T0T0 x x x 1k1k k S k =(S 0 +T 0 )x1 k x[0.25] - (-0.75S T 0 )x(-0.2) k T k =(S 0 +T 0 )x1 k x[0.75] + [-0.75S T 0 ]x(-0.2) k

Introduction to Information Retrieval Example web graph

Introduction to Information Retrieval 28 Transition matrix with teleporting(α=0.14) d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval 29 Power method vectors xP k xxP 1 xP 2 xP 3 xP 4 xP 5 xP 6 xP 7 xP 8 xP 9 xP 10 xP 11 xP 12 xP 13 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval 30 Example web graph PageRank d0d d1d d2d d3d d4d d5d d6d6 0.31

Introduction to Information Retrieval Outline ❶ Anchor Text ❷ PageRank ❸ HITS: Hubs & Authorities

Introduction to Information Retrieval 32 Hits – Hyperlink-Induced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of relevance.

Introduction to Information Retrieval 33 Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic.  Circular definition – we will turn this into an iterative computation.

Introduction to Information Retrieval 34 How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set  Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph)

Introduction to Information Retrieval Root set and base set (1) The root set root set

Introduction to Information Retrieval Root set and base set (1) Nodes that root set nodes link to root set

Introduction to Information Retrieval Root set and base set (1) Nodes that link to root set nodes root set

Introduction to Information Retrieval Root set and base set (1) The base set root set base set

Introduction to Information Retrieval 39 Root set and base set (2)  Root set typically has nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow out-links by parsing the pages in the root set  Find d’s in-links by searching for all pages containing a link to d

Introduction to Information Retrieval 40 Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authorities  So we output two ranked lists

Introduction to Information Retrieval 41 Iterative update  For all d: h(d) =  For all d: a(d) =  Iterate these two steps until convergence

Introduction to Information Retrieval 42 Hub & Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  Danger: topic drift – the pages found by following links may not be related to the original query.

Introduction to Information Retrieval 43 Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N, the matrix entry A ij tells us whether there is a link from page i to page j ( A ij = 1) or not (A ij = 0).  Example: d1d1 d2d2 d3d3 d1d1 010 d2d2 111 d3d3 100

Introduction to Information Retrieval 44 Write update rules as matrix operations  Define the hub vector h = (h 1,..., h N ) as the vector of hub scores. h is the hub score of page d.  Similarity for a, the vector of authority scores  Now we can write h(d) = as a matrix operation: h = Aa... ... and we can write a(d) = as a a = A T h  HITS algorithm in matrix notation:  Compute h = Aa  Compute a = A T h  Iterate until convergence

Introduction to Information Retrieval 45 HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:  Compute h = Aa  Compute a = A T h  By substitution we get: h = AA T h and a = A T Aa  Thus, h is an eigenvector of AA T and a is an eigenvector of A T A.  So the HITS algorithm is actually a special case of the power method and hub and authority scores are eigenvector values.  HITS and PageRank both formalize link analysis as eigenvector problems.

Introduction to Information Retrieval Example web graph Matrix A d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval Hub vectors h 0, h i = A*a i, i ≥1 h0h0 h1h1 h2h2 h3h3 h4h4 h5h5 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval Authority vector a = A T *h i-1, i ≥ 1 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval 49 Example web graph ah d0d d1d d2d d3d d4d d5d d6d

Introduction to Information Retrieval Top-ranked pages  Pages with highest in-degree: d 2, d 3, d 6  Pages with highest out-degree: d 2, d 6  Pages with highest PageRank: d 6  Pages with highest in-degree: d 6 (close: d 2 )  Pages with highest authority score: d 3

Introduction to Information Retrieval 51 PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  Claim: On the web, a good hub almost always is also a good authority.