CS 440 Database Management Systems Graph Data & PageRank
How the Web different from a database of documents?
How the Web different from a database of documents? Hypertext vs. text: a lot of additional clues graph vs. set anchor text vs. text: how others say about you? Geographically distributed vs. centralized so you need to build a crawler Precision more valued than recall quality is important than quantity, especially “broad” queries Spamming Hoaxes and more …
Web data and query Answer Basic data/query model Data model directed graph nodes: Web pages links: hyperlinks all nodes belong to the same type. Query is a set of terms Answer ranked list of relevant and important pages quantifying a subjective quality Basic data/query model more complex models, e.g., assigning types to pages.
Web search before Google Web as a set of documents Relevance: content-based retrieval documents match queries by contents q: ’clinton’ rank higher pages with more ‘clinton’ Importance??? contents: what documents say about themselves many spams and unreliable information in the results. Directory services were used Yahoo! was one of the leaders Google co-founders were told “nobody will use a keyword interface”.
Google: PageRank From the Stanford Digital Libraries project 1996-98 Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page
Web: Adjacent Matrix Web: G = {V, E} x y z V = {x, y, z}, |V| = n E = {(x, x), (x, y), (x, z), (y, z), (z, x), (z, y) } A: n x n matrix: Aij = 1 if page i links to page j, 0 if not target node x y 1 1 1 A = 0 0 1 1 1 0 source node z
Transposed Adjacent Matrix Adjacent matrix A: what does row j represent? Transpose At: 1 1 1 A = 0 0 1 1 1 0 row j in A: what nodes does node-j link to? row j in At: what nodes links to node-j? x y 1 0 1 At = 1 0 1 1 1 0 z
PageRank: importance of pages PageRank (or importance): recursively a page P is important if important pages link to it importance of P: proportionally contributed by the back-linked pages Example: vx = 1/2 vx + 1/2 vz vy = 1/2 vz vz = 1/2 vx + 1 vy Random-surfer interpretation: surfer randomly follows links to navigate PageRank = the prob. that surfer will visit the page x y z
Computing PageRank Importance-propagation equation: Computation: by relaxation linked-from (At) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 Transition Matrix 1/2 0 1/2 v= 0 0 1/2 v 1/2 1 0 v: 1 2 3 fixpoint 1 1 5/4 … 6/5 1 1/2 3/4 … 3/5 1 3/2 1 … 6/5 x y linked-from matrix: since importance comes from the “source” nodes to the target nodes sum of column = 1, since the source node distributes its importance to all the nodes that it links to z
Problems: Dead Ends x y a b z Dead ends: Example: page without successors has nowhere to send its importance eventually, what would happen to v? Example: va = 0 va + 0 vb vb = 1 va + 0 vb x y a b z
Problems: Spider Trap x y a b z Spider traps: Example: Solutions?? group of pages without out-of-group links will trap a spider inside what would happen to v? Example: va = 1/2 va + 0 vb vb = 1/2 va + 1 vb Solutions?? x y a b z
Solutions: surfer’s random jump Surfer can randomly jump to a new page without following links M: transition matrix, e: a vector with all 1’s, n: number of nodes in the graph d: damping factor (set to .85 in paper) model the probability of randomly jumping to this page another interpretation: “tax” importance of each page and distribute to all pages Teleportation v = d M v + (1-d) e / n PR(A): PageRank of page A T1, ... Tn: pages point to A; C(Ti): out degree of Ti (# of outlinks)
Anti-Spamming Spamming: Google anti-spam device: attempt to create artifacts to “please” search engines so that ranking will be high e.g., commercial “search engine optimization service” Google anti-spam device: unlike other search engines, tends to believe what others say about you by links and anchor texts recursive importance also works: importance (not just links) propagate Still, not perfect solution
What you should know Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation