1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 22 Nov.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Google Pagerank: how Google orders your webpages Dan Teague NCSSM.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Experiments with MATLAB Experiments with MATLAB Google PageRank Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University, Taiwan
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Pádraig Cunningham University College Dublin Matrix Tutorial Transition Matrices Graphs Random Walks.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Roshnika Fernando P AGE R ANK. W HY P AGE R ANK ?  The internet is a global system of networks linking to smaller networks.  This system keeps growing,
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Algorithms for Radio Networks Winter Term 2005/2006.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Section 4.1 Eigenvalues and Eigenvectors
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Degree and Eigenvector Centrality
Iterative Aggregation Disaggregation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Piyush Kumar (Lecture 2: PageRank)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Adjacency Matrices and PageRank
Presentation transcript:

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov th Lecture Christian Schindelhauer

Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 22 Nov 2004

Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Overview Search Engines (March 2002)  Number of documents Search Engine Showdown Estimate (millions) Claim (millions) Google9681,500 WiseNut5791,500 AllTheWeb Northern Light AltaVista Hotbot MSN Search

Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Overview Search Engines (Dez. 2002)  Number of documents Search Engine Showdown Estimate (millions) Claim (millions) Google3,0333,083 AlltheWeb2,1062,116 AltaVista1,6891,000 WiseNut1,4531,500 Hotbot1,1473,000 MSN Search1,0183,000 Teoma1, NLResearch Gigablast275150

Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Problems of Searching the Web  Currently (Nov 2004) more than 8 billion = millions web-pages – words cover more than 95% of each text –much more web-pages than words –Users hardly ever look through more than 40 results  The problem is not to find a pattern, but to find the most important pages  Problems: –Important pages do not contain the search pattern does not contain sports car or even carwww.porsche.com does not contain web search enginewww.google.com does not contain airplanewww.airbus.com –Certain pages have nearly every word (dictionary) –Names are misleading is not the web-site of the white househttp:// is not about vegetableswww.theonion.com –Certain pattern can be found everywhere, e.g. page, web, windows,...

Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How to rank Web-pages  The main problem about searching the web is to rank the importance  Links are very helpful: –Humans are usually introduced on purpose –The context of the links gives some clues about the meaning of the web-page –Pages where many people point to are of probably very important –Most search rely on links  Other approach: Ontology of words –Compare the combination of words with the search word –Good for comparing text –Difficult if single word patterns are given

Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Anatomy of a Web Search Engine  “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page, Computer Networks and ISDN Systems, Vol. 30, 1-6, p , 1998  Design of the prototype –Stanford University 1998  Key components: –Web Crawler –Indexer –Pagerank –Searcher  Main difference between Google and other search engines (in 1998) –The Pagerank mechanism

Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Simplified PageRank-Algorithmus  Simplified PageRank-Algorithmus –Rank of a wep-page R(u)  [0,1] –Important pages hand their rank down to the pages they link to. –c is a normalisation factor such that ||R(u)|| 1 = 1, i.e. the sum of all page ranks add to 1 –Predecessor nodes B u –sucessor nodes F u

Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Simplifed Pagerank Algorithm and an example

Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix representaion R  c M R, where R is a vector (R(1),R(2),… R(n)) and M denotes the following n  n – Matrix

Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Simplified Pagerank Algorithm  Does it converge?  If it converges, does it converge to a single result?  Is the result reasonable?

Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Eigenvector and Eigenvalue of the Matrix  For vector x and n  n-matrix and a number λ: –If M x = λ x then x is called the eigenvector and λ the eigen-value  Every n  n-matrix M has at most n eigenvalues  Compute the eigenvalues by eigen-decomposition M x = λ x  (M - I λ) x = 0, where I is the identity matrix –This equality has only non-trivial solutions if Det(M - I λ) = 0 –This leads to a polynomial equation of degree n, which has always n solutions λ 1, λ 2,..., λ n (Fundamental theorem of algebra) –Solving the linear equations (M - I λ i ) x = 0 lead to the eigenvectors  The eigenvektor of the matrix is a fix point of the recursion of the simplified pagerank algorithm

Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer  Consider n discrete states and a sequence of random variable X 1, X 2,... over this set of states  The sequence X 1, X 2,... is a Markov chain if  A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: –Elements of the matrix M must be real numbers of [0, 1]. –The sum of all column in M is 1  Observation for the matrix M of the simpl. pagerank algorithm –M is stochastic if all nodes have at least one outgoing link Stochastic Matrices

Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Random Surfer  Consider the following algorithm –Start in a random web-page according to a probability distribution –Repeat the following for t rounds If no link is on this page, exit and produce no output Uniformly and randomly choose a link of the web-page Follow that link and go to this web-page –Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x 1,.., x n is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains

Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Eigenvalues of Stochastic Matrices  Notations –Die L1-Norm of a vector x is defined as –x  0, if for all i: x i  0 –x  0, if for all i: x i  0  Lemma For every stochastic matrix M and every vector x we have || M x || 1  || x || 1 || M x || 1 = || x || 1, if x  0 or x  0  Eigenvalues of M | i |   1  Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x  0 and ||x|| 1 = 1

Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The problem of periodicity - Example

Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Periodicity - Example 2

Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Periodic Matrices  Definition –A square matrix M such that the matrix power M k =M for k a positive integer is called a periodic matrix. –If k is the least such integer, then the matrix is said to have period k. –If k = 1, then M 2 = M and M is called idempotent.  Fact –For non-periodic matrices there are vectors x, such that lim k  M k x does not converge.  Definition –The directed graph G=(V,E) of a n x n-matrix consistis of the node set V={1,..., n} and has edges E = {(i,j) | M ij  0} –A path is a sequence of edges (u 1,u 2 ),(u 2,u 3 ),(u 3,u 4 ),..,(u t,u t+1 ) of a graph –A graph cycle is a path where the start node is the end node –A strongly connected subgraph S is a maximum sub-graph such that every graph cycle starting and ending in a node of S is contained in S.

Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Necessary and Sufficient Conditions for Periodicity  Theorem (necessary condition) –If the stochastic matrix M is periodic with period t  2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i.  Theorem (sufficient condition) –Let the graph consist of one strongly connected subgraph and –let L 1,L 2,..., L m be the lengths all directed graph cycles of maximal length n –Then M is non-periodic if and only if gcd(L 1,L 2,..., L m ) = 1  Notation: –gcd(L 1,L 2,..., L m ) = greatest common divisor of numbers L 1,L 2,..., L m  Corollary –If the graph is strongly connected and there exists a graph cycly of length 1 (i.e. a loop), then M is non-periodic.

Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Disadvantages of the Simplified Pagerank- Algorithm  The Web-graph has sinks, i.e. pages without links  M is not a stochastic matrix  The Web-graph is periodic  Convergence is uncertain  The Web-graph is not strongly connected  Several convergence vectors possible  Rank-sinks –Strongly connected subgraphs absorb all weight of the predecessors –All predecessors pointing to a web-page loose their weight.

Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The (non-simplified) Pagerank-Algorithm  Add to a sink links to all web-pages  Uniformly and randomly choose a web-page –With some probability q < 1 perform a step of the simplified Pagerank algorithm –With probability 1-q start with the first step (and choose a random web-page)  Note M ist stochastic

Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Properties of the Pagerank-Algorithm  Graph der Matrix is strongly connected  There are graph cycles of length 1 Theorem In non-periodic matrices of strongly connected graphs the Markov-chain converges to a unique eigenvector with eigenvalue 1.  PageRank converges to this unique eigenvector

24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 6th lecture Next lecture:Mo 29 Nov 2004, am, FU 116 Next exercise class: Mo 22 Nov 2004, 1.15 pm, F0.530 or We 24 Nov 2004, 1.00 pm, E2.316