CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Information Networks Link Analysis Ranking Lecture 8.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Chapter 6: Link Analysis
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Roberto Battiti, Mauro Brunato
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Class Presentations In-class, Tuesday and Thursday next week 2-person teams: –6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams –4 minutes, up to 4 slides Powerpoint or PDF is fine –Needs to be ed by 12 noon on the day of presentation Order of presentations will be announced later in the week

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Web Mining Web = a potentially enormous “data set” for data mining 3 primary aspects of “Web mining” 1.Web page content e.g., clustering Web pages based on their text content 2.Web connectivity e.g., characterizing distributions on path lengths between pages e.g., determining importance of pages from graph structure 3.Web usage e.g., understanding user behavior from Web logs All 3 are interconnected/interdependent –E.g., Google (and most search engines) use both content and connectivity –These slides: Web connectivity

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine The Web Graph G = (V, E) –V = set of all Web pages –E = set of all hyperlinks Number of nodes ? –Difficult to estimate –Crawling the Web is highly non-trivial > 10 billion Number of edges? –E = O(|V|) i.e., mean number of outlinks per page is a small constant

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine The Web Graph The Web graph is inherently dynamic –nodes and edges are continually appearing and disappearing Interested in general properties of the Web graph –What is the distribution of the number of in-links and out-links? –What is the distribution of number of pages per site? Typically power-laws for many of these distributions –How far apart are 2 randomly selected pages on the Web? What is the “average distance” between 2 random pages? –And so on…

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Social Networks Social networks = graphs –V = set of “actors” (e.g., students in a class) –E = set of interactions (e.g., collaborations) –Typically small graphs, e.g., |V| = 10 or 50 –Long history of social network analysis (e.g. at UCI) –Quantitative data analysis techniques that can automatically extract “structure” or information from graphs E.g., who is the most important “actor” in a network? E.g., are there clusters in the network? –Comprehensive reference: S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, 1994.

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Node Importance in Social Networks General idea is that some nodes are more important than others in terms of the structure of the graph In a directed graph, “in-degree” may be a useful indicator of importance –e.g., for a citation network among authors (or papers) in-degree is the number of citations => “importance” However: –“in-degree” is only a first-order measure in that it implicitly assumes that all edges are of equal importance

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Recursive Notions of Node Importance w ij = weight of link from node i to node j –assume  j w ij = 1 and weights are non-negative –e.g., default choice: w ij = 1/outdegree(i) more outlinks => less importance attached to each Define r j = importance of node j in a directed graph r j =  i w ij r i i,j = 1,….n Importance of a node is a weighted sum of the importance of nodes that point to it –Makes intuitive sense –Leads to a set of recursive linear equations

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Simple Example 123 4

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Simple Example

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Simple Example Weight matrix W

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Matrix-Vector form Recall r j = importance of node j r j =  i w ij r i i,j = 1,….n e.g., r 2 = 1 r r r r 4 = dot product of r vector with column 2 of W Let r = n x 1 vector of importance values for the n nodes Let W = n x n matrix of link weights => we can rewrite the importance equations as r = W T r

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Eigenvector Formulation Need to solve the importance equations for unknown r, with known W r = W T r We recognize this as a standard eigenvalue problem, i.e., A r = r (where A = W T) with = an eigenvalue = 1 and r = the eigenvector corresponding to = 1

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Eigenvector Formulation Need to solve for r in (W T – I) r =  Note: W is a stochastic matrix, i.e., rows are non-negative and sum to 1 Results from linear algebra tell us that: (a) Since W is a stochastic matrix, W and W T have the same eigenvectors/eigenvalues (b) The largest of these eigenvalues is always 1 (c) the vector r corresponds to the eigenvector corresponding to the largest eigenvector of W (or W T )

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Solution for the Simple Example W Solving for the eigenvector of W we get r = [ ] Results are quite intuitive, e.g., 2 is “most important”

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine PageRank Algorithm: Applying this idea to the Web 1.Crawl the Web to get nodes (pages) and links (hyperlinks) [highly non-trivial problem!] 2.Weights from each page = 1/(# of outlinks) 3.Solve for the eigenvector r (for = 1) of the weight matrix Computational Problem: –Solving an eigenvector equation scales as O(n 3 ) –For the entire Web graph n > 10 billion (!!) –So direct solution is not feasible Can use the power method (iterative) r (k+1) = W T r (k) for k=1,2,…..

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Power Method for solving for r r (k+1) = W T r (k) Define a suitable starting vector r (1) e.g., all entries 1/n, or all entries = indegree(node)/|E|, etc Each iteration is matrix-vector multiplication =>O(n 2 ) - problematic? no: since W is highly sparse (Web pages have limited outdegree), each iteration is effectively O(n) For sparse W, the iterations typically converge quite quickly: - rate of convergence depends on the “spectral gap” -> how quickly does error(k) = (    ) k go to 0 as a function of k ? -> if |  | is close to 1 (=  ) then convergence is slow - empirically: Web graph with 300 million pages -> 50 iterations to convergence (Brin and Page, 1998)

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Basic Principles of Markov Chains Discrete-time finite-state first-order Markov chain, K states Transition matrix A = K x K matrix –Entry a ij = P( state t = j | state t-1 = i), i, j = 1, … K –Rows sum to 1 (since  j P( state t = j | state t-1 = i) = 1) –Note that P(state |..) only depends on state t-1 P 0 = initial state probability = P(state 0 = i), i = 1, …K

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Simple Example of a Markov Chain K = 3 A = P 0 = [1/3 1/3 1/3]

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Steady-State (Equilibrium) Distribution for a Markov Chain Irreducibility: –A Markov chain is irreducible if there is a directed path from any node to any other node Steady-state distribution  for an irreducible Markov chain*:  i = probability that in the long run, chain is in state I The  ’s are solutions to  = A t  Note that this is exactly the same as our earlier recursive equations for node importance in a graph! * Note: technically, for a meaningful solution to exist for , A must be both irreducible and aperiodic

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Markov Chain Interpretation of PageRank W is a stochastic matrix (rows sum to 1) by definition –can interpret W as defining the transition probabilities in a Markov chain –w ij = probability of transitioning from node i to node j Markov chain interpretation: r = W T r -> these are the solutions of the steady-state probabilities for a Markov chain page importance  steady-state Markov probabilities  eigenvector

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine The Random Surfer Interpretation Recall that for the Web model, we set w ij = 1/outdegree(i) Thus, in using W for computing importance of Web pages, this is equivalent to a model where: –We have a random surfer who surfs the Web for an infinitely long time –At each page the surfer randomly selects an outlink to the next page –“importance” of a page = fraction of visits the surfer makes to that page – this is intuitive: pages that have better connectivity will be visited more often

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Potential Problems Page 1 is a “sink” (no outlink) Pages 3 and 4 are also “sinks” (no outlink from the system) Markov chain theory tells us that no steady-state solution exists - depending on where you start you will end up at 1 or {3, 4} Markov chain is “reducible”

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Making the Web Graph Irreducible One simple solution to our problem is to modify the Markov chain: –With probability  the random surfer jumps to any random page in the system (with probability of 1/n, conditioned on such a jump) –With probability 1-  the random surfer selects an outlink (randomly from the set of available outlinks) The resulting transition graph is fully connected => Markov system is irreducible => steady-state solutions exist Typically  is chosen to be between 0.1 and 0.2 in practice But now the graph is dense! However, power iterations can be written as: r (k+1) = (1-  ) W T r (k) + (  /n) 1 T –Complexity is still O(n) per iteration for sparse W

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine The PageRank Algorithm S. Brin and L. Page, The anatomy of a large-scale hypertextual search engine, in Proceedings of the 7 th WWW Conference, PageRank = the method on the previous slide, applied to the entire Web graph –Crawl the Web Store both connectivity and content –Calculate (off-line) the “pagerank” r for each Web page using the power iteration method How can this be used to answer Web queries: –Terms in the search query are used to limit the set of pages of possible interest –Pages are then ordered for the user via precomputed pageranks –The Google search engine combines r with text-based measures –This was the first demonstration that link information could be used for content-based search on the Web

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine SE1, etc, indicate different (anonymized) commercial search engines, all using link structure (e.g., PageRank) in their rankings TFIDF is a state-of-the-art search method (at the time) that does not use any link structure Link Structure helps in Web Search Singhal and Kaszkiel, WWW Conference, 2001

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine PageRank architecture at Google Ranking of pages more important than exact values of p i Pre-compute and store the PageRank of each page. – PageRank independent of any query or textual content. Ranking scheme combines PageRank with textual match –Unpublished – Many empirical parameters and human effort – Criticism : Ad-hoc coupling of query relevance and graph importance Massive engineering effort – Continually crawling the Web and updating page ranks

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Link Manipulation

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Conclusions PageRank algorithm was the first algorithm for link-based search –Many extensions and improvements since then See papers on class Web page –Same idea used in social networks for determining importance Real-world search involves many other aspects besides PageRank –E.g., use of logistic regression for ranking Learns how to predict relevance of page (represented by bag of words) relative to a query, using historical click data See paper by Joachims on class Web page Additional slides (optional) –HITS algorithm, Kleinberg, 1998

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine ADDITIONAL OPTIONAL SLIDES

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine PageRank: Limitations “rich get richer” syndrome –not as “democratic” as originally (nobly) claimed certainly not 1 vote per “WWW citizen” –also: crawling frequency tends to be based on pagerank –for detailed grumblings, see etc. not query-sensitive –random walk same regardless of query topic whereas real random surfer has some topic interests non-uniform jumping vector needed –would enable personalization (but requires faster eigenvector convergence) –Topic of ongoing research ad hoc mix of PageRank & keyword match score done in two steps for efficiency, not quality motivations

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine HITS: Hub and Authority Rankings J. Kleinberg, Authorative sources in a hyperlinked environment, Proceedings of ACM SODA Conference, –HITS – Hypertext Induced Topic Selection Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u]. Recursive quantitative definitions of hub and authority scores Relies on query-time processing –To select base set Vq of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r \in R via an inbound or outbound edge ( expanded set) –To deduce hubs and authorities that exist in a sub-graph of the Web

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Authority and Hubness a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Authority and Hubness Convergence Recursive dependency : a(v)  Σ h(w) h(v)  Σ a(w) Using Linear Algebra, we can prove: w Є pa[v] w Є ch[v] a(v) and h(v) converge

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine HITS Example {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  Start with a root set R {1, 2, 3, 4} Find a base subgraph:

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine HITS Example Results Authority Hubness Authority and hubness weights

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Stability of HITS vs PageRank (5 trials) HITS PageRank randomly deleted 30% of papers

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine HITS vs PageRank: Stability e.g. [Ng & Zheng & Jordan, IJCAI-01 & SIGIR-01] HITS can be very sensitive to change in small fraction of nodes/edges in link structure PageRank much more stable, due to random jumps propose HITS as bidirectional random walk –with probability d, randomly (p=1/n) jump to a node –with probability d-1: odd timestep: take random outlink from current node even timestep: go backward on random inlink of node –this HITS variant seems much more stable as d increased –issue: tuning d (d=1 most stable but useless for ranking)

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine Recommended Books