Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
Link Structure and Web Mining Shuying Wang
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Presented By: - Chandrika B N
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
DATA MINING LECTURE 11 Classification Naïve Bayes Graphs And Centrality.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 CS 430: Information Discovery Lecture 5 Ranking.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Roberto Battiti, Mauro Brunato
Quality of a search engine
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Greg Nilsen University of Pittsburgh April 2003
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Link Analysis Ranking

How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it?

Naïve ranking of query results Given query q Rank the web pages p in the index based on sim(p,q) Scenarios where this is not such a good idea?

Why Link Analysis? First generation search engines – view documents as flat text files – could not cope with size, spamming, user needs Example: Honda website, keywords: automobile manufacturer Second generation search engines – Ranking becomes critical – use of Web specific data: Link Analysis – shift from relevance to authoritativeness – a success story for the network analysis

Link Analysis: Intuition A link from page p to page q denotes endorsement – page p considers page q an authority on a subject – mine the web graph of recommendations – assign an authority value to every page

Link Analysis Ranking Algorithms Start with a collection of web pages Extract the underlying hyperlink graph Run the LAR algorithm on the graph Output: an authority weight for each node w w w w w

Algorithm input Query dependent: rank a small subset of pages related to a specific query – HITS (Kleinberg 98) was proposed as query dependent Query independent: rank the whole Web – PageRank (Brin and Page 98) was proposed as query independent

Query-dependent LAR Given a query q, find a subset of web pages S that are related to S Rank the pages in S based on some ranking criterion

Query-dependent input Root Set

Query-dependent input Root Set IN OUT

Query dependent input Root Set IN OUT

Query dependent input Root Set IN OUT Base Set

Properties of a good seed set S S is relatively small. S is rich in relevant pages. S contains most (or many) of the strongest authorities.

How to construct a good seed set S For query q first collect the t highest-ranked pages for q from a text-based search engine to form set Γ S = Γ Add to S all the pages pointing to Γ Add to S all the pages that pages from Γ point to

Link Filtering Navigational links: serve the purpose of moving within a site (or to related sites) → → → Filter out navigational links – same domain name – same IP address

How do we rank the pages in seed set S? In degree? Intuition Problems

Hubs and Authorities [K98] Authority is not necessarily transferred directly between authorities Pages have double identity – hub identity – authority identity Good hubs point to good authorities Good authorities are pointed by good hubs hubs authorities

HITS Algorithm Initialize all weights to 1. Repeat until convergence – O operation : hubs collect the weight of the authorities – I operation: authorities collect the weight of the hubs – Normalize weights under some norm

HITS and eigenvectors The HITS algorithm is a power-method eigenvector computation – in vector terms a t = A T h t-1 and h t = Aa t-1 – so a t = A T Aa t-1 and h t = AA T h t-1 – The authority weight vector a is the eigenvector of A T A and the hub weight vector h is the eigenvector of AA T – Why do we need normalization? The vectors a and h are singular vectors of the matrix A

Singular Value Decomposition r : rank of matrix A σ 1 ≥ σ 2 ≥ … ≥σ r : singular values (square roots of eig-vals AA T, A T A) : left singular vectors (eig-vectors of AA T ) : right singular vectors (eig-vectors of A T A) [n×r][r×r][r×n]

21 Singular Value Decomposition Linear trend v in matrix A: – the tendency of the row vectors of A to align with vector v – strength of the linear trend: Av SVD discovers the linear trends in the data u i, v i : the i-th strongest linear trends σ i : the strength of the i-th strongest linear trend σ1σ1 σ2σ2 v1v1 v2v2  HITS discovers the strongest linear trend in the authority space

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect ∙2

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect ∙ 2

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect ∙ 2 2

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect 3 2n 3 n ∙ 2 n after n iterations weight of node p is proportional to the number of (BF) n paths that leave node p

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities – Tightly Knit Community (TKC) effect after normalization with the max element as n → ∞

Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the pages in Q ordered according to order π What are the advantages of such an approach?

InDegree algorithm Rank pages according to in-degree – w i = |B(i)| 1.Red Page 2.Yellow Page 3.Blue Page 4.Purple Page 5.Green Page w=1 w=2 w=3 w=2

PageRank algorithm [BP98] Good authorities should be pointed by good authorities Random walk on the web graph – pick a page at random – with probability 1- α jump to a random page – with probability α follow a random outgoing link Rank according to the stationary distribution 1.Red Page 2.Purple Page 3.Yellow Page 4.Blue Page 5.Green Page

Markov chains A Markov chain describes a discrete time stochastic process over a set of states according to a transition probability matrix – P ij = probability of moving to state j when at state i ∑ j P ij = 1 (stochastic matrix) Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process (first order MC) – higher order MCs are also possible S = {s 1, s 2, … s n } P = {P ij }

Random walks Random walks on graphs correspond to Markov Chains – The set of states S is the set of nodes of the graph G – The transition probability matrix is the probability that we follow an edge from one node to another

An example v1v1 v2v2 v3v3 v4v4 v5v5

State probability vector The vector q t = (q t 1,q t 2, …,q t n ) that stores the probability of being at state i at time t – q 0 i = the probability of starting from state i q t = q t-1 P

An example v1v1 v2v2 v3v3 v4v4 v5v5 q t+1 1 = 1/3 q t 4 + 1/2 q t 5 q t+1 2 = 1/2 q t 1 + q t 3 + 1/3 q t 4 q t+1 3 = 1/2 q t 1 + 1/3 q t 4 q t+1 4 = 1/2 q t 5 q t+1 5 = q t 2

Stationary distribution A stationary distribution for a MC with transition matrix P, is a probability distribution π, such that π = πP A MC has a unique stationary distribution if – it is irreducible the underlying graph is strongly connected – it is aperiodic for random walks, the underlying graph is not bipartite The probability π i is the fraction of times that we visited state i as t → ∞ The stationary distribution is an eigenvector of matrix P – the principal left eigenvector of P – stochastic matrices have maximum eigenvalue 1

Computing the stationary distribution The Power Method – Initialize to some distribution q 0 – Iteratively compute q t = q t-1 P – After enough iterations q t ≈ π – Power method because it computes q t = q 0 P t Why does it converge? – follows from the fact that any vector can be written as a linear combination of the eigenvectors q 0 = v 1 + c 2 v 2 + … c n v n Rate of convergence – determined by λ 2 t

The PageRank random walk Vanilla random walk – make the adjacency matrix stochastic and run a random walk

The PageRank random walk What about sink nodes? – what happens when the random walk moves to a node without any outgoing inks?

The PageRank random walk Replace these row vectors with a vector v – typically, the uniform vector P’ = P + dv T

The PageRank random walk How do we guarantee irreducibility? – add a random jump to vector v with prob α typically, to a uniform vector P’’ = αP’ + (1-α)uv T, where u is the vector of all 1s

Effects of random jump Guarantees irreducibility Motivated by the concept of random surfer Offers additional flexibility – personalization – anti-spam Controls the rate of convergence – the second eigenvalue of matrix P’’ is α

A PageRank algorithm Performing vanilla power method is now too expensive – the matrix is not sparse q 0 = v t = 1 repeat t = t +1 until δ < ε Efficient computation of y = (P’’) T x

Random walks on undirected graphs In the stationary distribution of a random walk on an undirected graph, the probability of being at node i is proportional to the (weighted) degree of the vertex Random walks on undirected graphs are not “interesting”

Research on PageRank Specialized PageRank – personalization [BP98] instead of picking a node uniformly at random favor specific nodes that are related to the user – topic sensitive PageRank [H02] compute many PageRank vectors, one for each topic estimate relevance of query with each topic produce final PageRank as a weighted combination Updating PageRank [Chien et al 2002] Fast computation of PageRank – numerical analysis tricks – node aggregation techniques – dealing with the “Web frontier”

Previous work The problem of identifying the most important nodes in a network has been studied before in social networks and bibliometrics The idea is similar – A link from node p to node q denotes endorsement – mine the network at hand – assign an centrality/importance/standing value to every node

Social network analysis Evaluate the centrality of individuals in social networks – degree centrality the (weighted) degree of a node – distance centrality the average (weighted) distance of a node to the rest in the graph – betweenness centrality the average number of (weighted) shortest paths that use node v

Counting paths – Katz 53 The importance of a node is measured by the weighted sum of paths that lead to this node A m [i,j] = number of paths of length m from i to j Compute converges when b < λ 1 (A) Rank nodes according to the column sums of the matrix P

Bibliometrics Impact factor (E. Garfield 72) – counts the number of citations received for papers of the journal in the previous two years Pinsky-Narin 76 – perform a random walk on the set of journals – P ij = the fraction of citations from journal i that are directed to journal j

References [BP98] S. Brin, L. Page, The anatomy of a large scale search engine, WWW 1998 [K98] J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, G. Pinski, F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12(1976), pp L. Katz. A new status index derived from sociometric analysis. Psychometrika 18(1953). R. Motwani, P. Raghavan, Randomized Algorithms S. Kamvar, T. Haveliwala, C. Manning, G. Golub, Extrapolation methods for Accelerating PageRank Computation, WWW2003 A. Langville, C. Meyer, Deeper Inside PageRank, Internet Mathematics