Link Analysis 2 Page Rank Variants

Slides:



Advertisements
Similar presentations
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Singular Value Decomposition and Data Management
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
Lecture 2: Extensions of PageRank
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Machine Learning: Ensemble Methods
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Lecture 22 SVD, Eigenvector, and Web Search
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Link Analysis 2 Page Rank Variants CS345 Data Mining Link Analysis 2 Page Rank Variants Anand Rajaraman, Jeffrey D. Ullman

Topics This lecture Many-walkers model Tricks for speeding convergence Topic-Specific Page Rank

Random walk interpretation At time 0, pick a page on the web uniformly at random to start the walk Suppose at time t, we are at page j At time t+1 With probability , pick a page uniformly at random from O(j) and walk to it With probability 1-, pick a page on the web uniformly at random and teleport into it Page rank of page p = “steady state” probability that at any given time, the random walker is at page p

Many random walkers Alternative, equivalent model Imagine a large number M of independent, identical random walkers (MÀN) At any point in time, let M(p) be the number of random walkers at page p The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.

Speeding up convergence Exploit locality of links Pages tend to link most often to other pages within the same host or domain Partition pages into clusters host, domain, … Compute local page rank for each cluster can be done in parallel Compute page rank on graph of clusters Initial rank of a page is the product of its local rank and the rank of its cluster Use as starting vector for normal page rank computation 2-3x speedup

In Pictures Ranks of clusters 1.5 2.05 0.05 Intercluster weights 2.0 0.1 Local ranks Initial eigenvector 3.0 0.15

Other tricks Adaptive methods Extrapolation Typically, small speedups ~20-30%

Problems with page rank Measures generic popularity of a page Biased against topic-specific authorities Ambiguous queries e.g., jaguar This lecture Uses a single measure of importance Other models e.g., hubs-and-authorities Next lecture Susceptible to Link spam Artificial link topographies created in order to boost page rank

Topic-Specific Page Rank Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When the random walker teleports, he picks a page from a set S of web pages S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org) For each teleport set S, we get a different rank vector rS

Matrix formulation Aij = Mij + (1-)/|S| if i 2 S Aij = Mij otherwise Show that A is stochastic We have weighted all pages in the teleport set S equally Could also assign different weights to them

Example Suppose S = {1},  = 0.8 1 2 3 Node Iteration 0 1 2… stable 0.2 Suppose S = {1},  = 0.8 1 0.5 1 0.4 0.8 2 3 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 4 Note how we initialize the page rank vector differently from the unbiased page rank case.

How well does TSPR work? Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZ E.g., arts, business, sports,… “Blind study” using volunteers 35 test queries Results ranked using Page Rank and TSPR of most closely related topic E.g., bicycling using Sports ranking In most cases volunteers preferred TSPR ranking

Which topic ranking to use? User can pick from a menu Use Bayesian classification schemes to classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., “basketball” followed by “jordan” User context e.g., user’s My Yahoo settings, bookmarks, …

Evaporation model Alternative, equivalent interpretation of page rank Instead of random teleport Assume random surfers “evaporate” from each page at rate (1-) per time step those surfers vanish from the system New random surfers enter the system at the teleport set pages Total of (1-)M at each step System reaches stable state evaporation at each time step = number of new surfers at each time step

Evaporation-based computation 0.2 Suppose S = {1},  = 0.8 1 0.4 0.8 2 3 Node Iteration 0 1 2… stable 1 0.2 0.2 0.264 0.294 2 0 0.08 0.08 0.118 3 0 0.08 0.08 0.327 4 0 0 0.064 0.261 4 Note how we initialize the page rank vector differently in this model

Scaling with topics and users Suppose we wanted to cover 1000’s of topics Need to compute 1000’s of different rank vectors Need to store and retrieve them efficiently at query time For good performance vectors must fit in memory Even harder when we consider personalization Each user has their own teleport vector One page rank vector per user!

Tricks Determine a set of basis vectors so that any rank vector is a linear combination of basis vectors Encode basis vectors compactly as partial vectors and a hubs skeleton At runtime perform a small amount of computation to derive desired rank vector elements

Linearity Theorem Let S be a teleport set and rS be the corresponding rank vector For page i2S, let ri be the rank vector corresponding to the teleport set {i} ri is a vector with N entries rS = (1/|S|) åi2S ri Why is linearity important? Instead of 2N biased page rank vectors we need to store N vectors

Linearity example Let us compute r{1,2} for  = 0.8 1 2 3 4 5 1 2 3 4 0.4 0.8 0.1 1 2 3 4 5 Let us compute r{1,2} for  = 0.8 Node Iteration 0 1 2… stable 1 0.1 0.1 0.164 0.300 2 0.1 0.14 0.172 0.323 3 0 0.04 0.04 0.120 4 0 0.04 0.056 0.130 5 0 0.04 0.056 0.130

Linearity example r{1,2} r1 r2 (r1+r2)/2 1 2 3 4 5 0.300 0.323 0.120 0.130 0.407 0.239 0.163 0.096 0.192 0.407 0.077 0.163 0.300 0.323 0.120 0.130

Intuition behind proof Let’s use the many-random-walkers model with M random walkers Let us color a random walker with color i if his most recent teleport was to page i At time t, we expect M/|S| of the random walkers to be colored i At any page j, we would therefore expect to find (M/|S|)ri(j) random walkers colored i So total number of random walkers at page j = (M/|S|)åi2Sri(j)

Basis Vectors Suppose T = union of all teleport sets of interest Call it the teleport universe We can compute the rank vector corresponding to any teleport set SµT as a linear combination of the vectors ri for i2T We call these vectors the basis vectors for T We can also compute rank vectors where we assign different weights to teleport pages

Decomposition Still too many basis vectors E.g., |T| might be in the thousands N|T| values Decompose basis vectors into partial vectors and hubs skeleton

Tours Consider a random walker with teleport set {i} Suppose walker is currently at node j The random walker’s tour is the sequence of nodes on the walker’s path since the last teleport E.g., i,a,b,c,a,j Nodes can repeat in tours – why? Interior nodes of the tour = {a,b,c} Start node = {i}, end node = {j} A page can be both start node and interior node, etc

Tour splitting Consider random walker with teleport set {i}, biased rank vector ri ri(j) = probability random walker reaches j by following some tour with start node i and end node j Consider node k Can have k = j but not k = i i k j

Tour splitting ri(j) = rik(j) + ri~k(j) Let rik(j) be the probability that random surfer reaches page j through a tour that includes page k as an interior or end node. Let ri~k(j) be the probability that random surfer reaches page j through a tour that does not include k as an interior or end node. ri(j) = rik(j) + ri~k(j) i k j

Example Let us compute r1~2 for  = 0.8 Note that many entries are 0.2 1 2 3 4 5 0.4 0.8 0.2 0.4 1 2 0.4 0.8 0.8 0.4 0.4 0.8 3 4 5 Let us compute r1~2 for  = 0.8 Node Iteration 0 1 2… stable 1 0.2 0.2 0.264 0.294 2 0 0 0 0 3 0 0.08 0.08 0.118 4 0 0 0 0 5 0 0 0 0 Note that many entries are zeros

Example Let us compute r2~2 for  = 0.8 1 2 3 4 5 Node Iteration 0.2 0.4 1 2 0.4 0.8 0.8 0.4 0.4 0.8 3 4 5 Let us compute r2~2 for  = 0.8 Node Iteration 0 1 2… stable 1 0 0 0.064 0.094 2 0.2 0.2 0.2 0.2 3 0 0 0 0.038 4 0 0.08 0.08 0.08 5 0 0.08 0.08 0.08

Rank composition Notice: r12(3) = r1(3) – r1~2(3) = 0.163 - 0.118 = 0.045 r1(2) * r2~2(3) = 0.239 * 0.038 = 0.009 = 0.2 * 0.045 = (1-)*r12(3) r12(3) = r1(2) r2~2(3)/ (1-)

Rank composition rk~k(j) ri(k) j i k rik(j) = ri(k)rk~k(j)/(1-)

Hubs Instead of a single page k, we can use a set H of “hub” pages Define ri~H(j) as set of tours from i to j that do not include any node from H as interior nodes or end node

Hubs example r2~H r1~H H = {1,2}  = 0.8 1 2 3 4 5 Node Iteration 0.2 0.2 0.4 1 2 H = {1,2} 0.4 0.8  = 0.8 0.8 0.4 0.4 0.8 3 4 5 r2~H Node Iteration 0 1 stable 1 0 0 0 2 0.2 0.2 0.2 3 0 0 0 4 0 0.08 0.08 5 0 0.08 0.08 r1~H Node Iteration 0 1 stable 1 0.2 0 0.2 2 0 0 0 3 0 0.08 0.08 4 0 0 0 5 0 0 0

Rank composition with hubs rh~H(j) wi(h) j i ri(j) = ri~H(j) + riH(j) H riH(j) = åh2Hwi(h)rh~H(j)/(1-b) ri~H(j) wi(h) = ri(h) if i = h wi(h) = ri(h) - (1-b) if i = h

Hubs rule example r2(3) = r2~H(3) + r2H(3) = 0 + r2H(3) 1 2 3 4 5 H H = {1,2}  = 0.8 r2(3) = r2~H(3) + r2H(3) = 0 + r2H(3) = [r2(1)r1~H(3)]/0.2+[(r2(2)-0.2)r2~H(3)]/0.2 = [0.192*0.08]/0.2+[(0.407-0.2)*0]/0.2 = 0.077

Hubs Start with H = T, the teleport universe Add nodes to H such that given any pair of nodes i and j, there is a high probability that H separates i and j i.e., ri~H(j) is zero for most i,j pairs Observation: high page rank nodes are good separators and hence good hub nodes

Hubs skeleton To compute ri(j) we need: ri~H(j) rh~H(j) ri(h) j i H ri~H(j) To compute ri(j) we need: ri~H(j) for all i2H, j2V called the partial vector Sparse ri(h) for all h2H called the hubs skeleton

Storage reduction Say |T| = 1000, |H|=2000, N = 1 billion Store all basis vectors 1000*1 billion = 1 trillion nonzero values Use partial vectors and hubs skeleton Suppose each partial vector has N/200 nonzero entries Partial vectors = 2000*N/200 = 10 billion nonzero values Hubs skeleton = 2000*2000 = 4 million values Total = approx 10 billion nonzero values Approximately 100x compression