CS 440 Database Management Systems

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Page Rank Done by: Asem Battah Supervised by: Dr. Samir Tartir Done by: Asem Battah Supervised by: Dr. Samir Tartir.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Google PageRank Algorithm
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
OCR A-Level Computing - Unit 01 Computer Systems Lesson 1. 3
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Lecture #11 PageRank (II)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
PageRank and Markov Chains
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Centrality in Social Networks
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
PageRank algorithm based on Eigenvectors
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Description of PageRank
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

CS 440 Database Management Systems Graph Data & PageRank

How the Web different from a database of documents?

How the Web different from a database of documents? Hypertext vs. text: a lot of additional clues graph vs. set anchor text vs. text: how others say about you? Geographically distributed vs. centralized so you need to build a crawler Precision more valued than recall quality is important than quantity, especially “broad” queries Spamming Hoaxes and more …

Web data and query Answer Basic data/query model Data model directed graph nodes: Web pages links: hyperlinks all nodes belong to the same type. Query is a set of terms Answer ranked list of relevant and important pages quantifying a subjective quality Basic data/query model more complex models, e.g., assigning types to pages.

Web search before Google Web as a set of documents Relevance: content-based retrieval documents match queries by contents q: ’clinton’  rank higher pages with more ‘clinton’ Importance??? contents: what documents say about themselves many spams and unreliable information in the results. Directory services were used Yahoo! was one of the leaders Google co-founders were told “nobody will use a keyword interface”.

Google: PageRank From the Stanford Digital Libraries project 1996-98 Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page

Web: Adjacent Matrix Web: G = {V, E} x y z V = {x, y, z}, |V| = n E = {(x, x), (x, y), (x, z), (y, z), (z, x), (z, y) } A: n x n matrix: Aij = 1 if page i links to page j, 0 if not target node x y 1 1 1 A = 0 0 1 1 1 0 source node z

Transposed Adjacent Matrix Adjacent matrix A: what does row j represent? Transpose At: 1 1 1 A = 0 0 1 1 1 0 row j in A: what nodes does node-j link to? row j in At: what nodes links to node-j? x y 1 0 1 At = 1 0 1 1 1 0 z

PageRank: importance of pages PageRank (or importance): recursively a page P is important if important pages link to it importance of P: proportionally contributed by the back-linked pages Example: vx = 1/2 vx + 1/2 vz vy = 1/2 vz vz = 1/2 vx + 1 vy Random-surfer interpretation: surfer randomly follows links to navigate PageRank = the prob. that surfer will visit the page x y z

Computing PageRank Importance-propagation equation: Computation: by relaxation linked-from (At) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 Transition Matrix 1/2 0 1/2 v= 0 0 1/2 v 1/2 1 0 v: 1 2 3 fixpoint 1 1 5/4 … 6/5 1 1/2 3/4 … 3/5 1 3/2 1 … 6/5 x y linked-from matrix: since importance comes from the “source” nodes to the target nodes sum of column = 1, since the source node distributes its importance to all the nodes that it links to z

Problems: Dead Ends x y a b z Dead ends: Example: page without successors has nowhere to send its importance eventually, what would happen to v? Example: va = 0 va + 0 vb vb = 1 va + 0 vb x y a b z

Problems: Spider Trap x y a b z Spider traps: Example: Solutions?? group of pages without out-of-group links will trap a spider inside what would happen to v? Example: va = 1/2 va + 0 vb vb = 1/2 va + 1 vb Solutions?? x y a b z

Solutions: surfer’s random jump Surfer can randomly jump to a new page without following links M: transition matrix, e: a vector with all 1’s, n: number of nodes in the graph d: damping factor (set to .85 in paper) model the probability of randomly jumping to this page another interpretation: “tax” importance of each page and distribute to all pages Teleportation v = d M v + (1-d) e / n PR(A): PageRank of page A T1, ... Tn: pages point to A; C(Ti): out degree of Ti (# of outlinks)

Anti-Spamming Spamming: Google anti-spam device: attempt to create artifacts to “please” search engines so that ranking will be high e.g., commercial “search engine optimization service” Google anti-spam device: unlike other search engines, tends to believe what others say about you by links and anchor texts recursive importance also works: importance (not just links) propagate Still, not perfect solution

What you should know Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation