Download presentation
Presentation is loading. Please wait.
1
CS 440 Database Management Systems
Graph Data & PageRank
2
How the Web different from a database of documents?
3
How the Web different from a database of documents?
Hypertext vs. text: a lot of additional clues graph vs. set anchor text vs. text: how others say about you? Geographically distributed vs. centralized so you need to build a crawler Precision more valued than recall quality is important than quantity, especially “broad” queries Spamming Hoaxes and more …
4
Web data and query Answer Basic data/query model Data model
directed graph nodes: Web pages links: hyperlinks all nodes belong to the same type. Query is a set of terms Answer ranked list of relevant and important pages quantifying a subjective quality Basic data/query model more complex models, e.g., assigning types to pages.
5
Web search before Google
Web as a set of documents Relevance: content-based retrieval documents match queries by contents q: ’clinton’ rank higher pages with more ‘clinton’ Importance??? contents: what documents say about themselves many spams and unreliable information in the results. Directory services were used Yahoo! was one of the leaders Google co-founders were told “nobody will use a keyword interface”.
6
Google: PageRank From the Stanford Digital Libraries project 1996-98
Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page
7
Web: Adjacent Matrix Web: G = {V, E} x y z V = {x, y, z}, |V| = n
E = {(x, x), (x, y), (x, z), (y, z), (z, x), (z, y) } A: n x n matrix: Aij = 1 if page i links to page j, 0 if not target node x y A = source node z
8
Transposed Adjacent Matrix
Adjacent matrix A: what does row j represent? Transpose At: A = row j in A: what nodes does node-j link to? row j in At: what nodes links to node-j? x y At = z
9
PageRank: importance of pages
PageRank (or importance): recursively a page P is important if important pages link to it importance of P: proportionally contributed by the back-linked pages Example: vx = 1/2 vx + 1/2 vz vy = 1/2 vz vz = 1/2 vx + 1 vy Random-surfer interpretation: surfer randomly follows links to navigate PageRank = the prob. that surfer will visit the page x y z
10
Computing PageRank Importance-propagation equation:
Computation: by relaxation linked-from (At) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 Transition Matrix 1/ /2 v= /2 v 1/ v: fixpoint /4 … 6/5 1 1/2 3/4 … 3/5 1 3/ … 6/5 x y linked-from matrix: since importance comes from the “source” nodes to the target nodes sum of column = 1, since the source node distributes its importance to all the nodes that it links to z
11
Problems: Dead Ends x y a b z Dead ends: Example:
page without successors has nowhere to send its importance eventually, what would happen to v? Example: va = 0 va + 0 vb vb = 1 va + 0 vb x y a b z
12
Problems: Spider Trap x y a b z Spider traps: Example: Solutions??
group of pages without out-of-group links will trap a spider inside what would happen to v? Example: va = 1/2 va + 0 vb vb = 1/2 va + 1 vb Solutions?? x y a b z
13
Solutions: surfer’s random jump
Surfer can randomly jump to a new page without following links M: transition matrix, e: a vector with all 1’s, n: number of nodes in the graph d: damping factor (set to .85 in paper) model the probability of randomly jumping to this page another interpretation: “tax” importance of each page and distribute to all pages Teleportation v = d M v + (1-d) e / n PR(A): PageRank of page A T1, ... Tn: pages point to A; C(Ti): out degree of Ti (# of outlinks)
14
Anti-Spamming Spamming: Google anti-spam device:
attempt to create artifacts to “please” search engines so that ranking will be high e.g., commercial “search engine optimization service” Google anti-spam device: unlike other search engines, tends to believe what others say about you by links and anchor texts recursive importance also works: importance (not just links) propagate Still, not perfect solution
15
What you should know Web data and query model
PageRank formula and algorithm Dead ends and spider traps Teleportation
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.