Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 06 Dec.

Similar presentations


Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 06 Dec."— Presentation transcript:

1 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 06 Dec 2004 8th Lecture Christian Schindelhauer schindel@upb.de

2 Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Organization 06 Dec 2004 Mid Term Exam

3 Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mid Term Exam  Wednesday, 8 Dec 2004, 1pm-1.45pm, F1.110  4 parts –1. short questions, testing general understanding –2.-4. Show that you understand Text search algorithms Searching in Compressed Text The Pagerank algorithm  If you have successfully presented an exercise: –Only the best 3 of 4 parts count  If you fail, or if you receive a bad grade –then the oral exam at the end of the semester will cover the complete lecture  If you are happy with your grade –this grade counts half of the complete lecture –if you succeed within the oral exam (over the rest of the lecture)

4 Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 06 Dec 2004

5 Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

6 Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Simplified PageRank-Algorithmus  Simplified PageRank-Algorithmus –Rank of a wep-page R(u)  [0,1] –Important pages hand their rank down to the pages they link to. –c is a normalisation factor such that ||R(u)|| 1 = 1, i.e. the sum of all page ranks add to 1 –Predecessor nodes B u –sucessor nodes F u

7 Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Simplifed Pagerank Algorithm and an example

8 Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix representaion R  c M R, where R is a vector (R(1),R(2),… R(n)) and M denotes the following n  n – Matrix

9 Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer  Consider n discrete states and a sequence of random variable X 1, X 2,... over this set of states  The sequence X 1, X 2,... is a Markov chain if  A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: –Elements of the matrix M must be real numbers of [0, 1]. –The sum of all column in M is 1  Observation for the matrix M of the simpl. pagerank algorithm –M is stochastic if all nodes have at least one outgoing link Stochastic Matrices

10 Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Random Surfer for Simplified Pagerank  Consider the following algorithm –Start in a random web-page according to a probability distribution –Repeat the following for t rounds If no link is on this page, exit and produce no output Uniformly and randomly choose a link of the web-page Follow that link and go to this web-page –Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x 1,.., x n is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains

11 Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Eigenvalues of Stochastic Matrices  Notations –Die L1-Norm of a vector x is defined as –x  0, if for all i: x i  0 –x  0, if for all i: x i  0  Lemma For every stochastic matrix M and every vector x we have || M x || 1  || x || 1 || M x || 1 = || x || 1, if x  0 or x  0  Eigenvalues of M | i |   1  Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x  0 and ||x|| 1 = 1

12 Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Periodicity - Example

13 Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Necessary and Sufficient Conditions for Periodicity  Theorem (necessary condition) –If the stochastic matrix M is periodic with periodicity t  2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i.  Theorem (sufficient condition) –Let the graph consist of one strongly connected subgraph and –let L 1,L 2,..., L m be the lengths all directed graph cycles of maximal length n –Then M is non-periodic if and only if gcd(L 1,L 2,..., L m ) = 1  Notation: –gcd(L 1,L 2,..., L m ) = greatest common divisor of numbers L 1,L 2,..., L m  Corollary –If the graph is strongly connected and there exists a graph cycle of length 1 (i.e. a loop), then M is non-periodic.

14 Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Disadvantages of the Simplified Pagerank- Algorithm  The Web-graph has sinks, i.e. pages without links  M is not a stochastic matrix  The Web-graph is periodic  Convergence is uncertain  The Web-graph is not strongly connected  Several convergence vectors possible  Rank-sinks –Strongly connected subgraphs absorb all weight of the predecessors –All predecessors pointing to a web-page loose their weight.

15 Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The (non-simplified) Pagerank-Algorithm  Add to a sink links to all web-pages  Uniformly and randomly choose a web-page –With some probability q < 1 perform a step of the simplified Pagerank algorithm –With probability 1-q start with the first step (and choose a random web-page)  Note M ist stochastic

16 Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Effect of Filling Up the Matrix 0.01 0.97 0.01 0.97 0.01 0.97 0.01 0.97 0.01 ( ) )( 0 0 0 1 = )( 0.97 0.01 Example: q = 0.96, n = 4, Graph: Startvector x = (1,0,0,0) T Observation: –All entries of M are at least (1-q)/n (by definition; here 0.01) –All entries of M x are at least ||x|| 1 (1-q)/n (here 0.01)  Fact: For all vectors x  0: (M x) i  ||x|| 1 (1-q)/n –sum over all rows  Fact:For all vectors x  0: (M x) i  -||x|| 1 (1-q)/n 14 32

17 Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer What Happens to Mixed Vectors 0.01 0.97 0.01 0.97 0.01 0.97 0.01 0.97 0.01 ( ) )( 0 0 +1 + 0.01 0.97 0.01 0.97 0.01 0.97 0.01 0.97 0.01 ( ) )( 0 0 +1 0 = 0.01 0.97 0.01 0.97 0.01 0.97 0.01 0.97 0.01 ( ) 0 0 0 )( )( 0.01 0.96 )( -0.01 -0.96 -0.01 + )( 0 -0.95 0 0.95 = = ||z|| 1 =2 ||p|| 1 = 1 ||m|| 1 = 1 For all i: ||(M p) i ||  0.01 For all i: ||(M m) i ||  -0.01 ||M z|| 1  2 - 4·0.01=1.96

18 Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Converging to the first Eigenvector Observation: –Entries of M are at least (1-q)/n –Entries of M x are at least ||x|| 1 (1-q)/n  Fact: –For all vectors z  0: (M z) i  ||z|| 1 (1-q)/n –For all vectors z  0: (M z) i  -||z|| 1 (1-q)/n  Lemma –For x  0 or x  0: ||Mx|| 1 = ||x|| 1  Let x be an eigenvector with eigenvalue 1  For arbitrary vector y  0 let z = x-y –Decompose z = m + p, –where p  0 and m  0 and ||p|| 1 + ||m|| 1 = |z| 1 ||x-M y|| 1 = ||M(x-y)|| 1 = || M z || 1 = || M (p+m) || 1 = || Mp + Mm || 1 = ||∑ i (Mp) i + (Mm) i || 1  ∑ i max{|(Mp) i |,|(Mm) i |} - ||z|| 1 (1-q)/n  |∑ i (Mp) i | + |∑ i (Mm) i | - ||z|| 1 (1-q) = || p || 1 + || m || 1 - || z || 1 (1-q) = || z || 1 - || z || 1 (1-q) = q || x-y || 1  After each iteration the distance between y and x decreases by a factor of q

19 Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Properties of the Pagerank-Algorithm Lemma There is a unique (real) eigen-vector with eigenvalue 1 for the matrix of the Pagerank-algorithm. Proof: Let x be an eigen-vector with eigenvalue 1, then this lemma follows from for all y  x: ||x-M y|| 1 = q || x-y || 1 < || x-y || 1 Lemma Let x be the (unique real) eigen-vector, M be the matrix of the Pagerank-algorithm, and q the probability parameter of Pagerank. Then for all real vectors y: || M (x-y) || 1  q ||x-y|| 1 Theorem PageRank converges to an (1+  )- approximation of the unique eigenvector in at most (ln  - ln n) / ln q iterations.

20 Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Discussion  q = probability to use simpl. pagerank  If q is small –Pagerank converges faster –Smaller hops are considered –Less structural information is available –The page-ranks become more the same  If q is large –Pagerank (possibly) converges slower –Longer hops play a role –Web-sinks collect more weight Therefore Google deletes web-sinks from the Web-Graph  Problem: –How to choose q –Is it reasonable to give every web-page a pagerank independently from the search pattern?

21 Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kleinberg’s HITS-Algorithm (HyperText Induced Search)  Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): 604-632(1999)  Idea of the Algorithm –Pages can serve as Authorities (like in pagerank) or Hubs –Hub pages point to interesting links to authorities = relevant pages E.g. railway fans collect links of railway companies –Authorities are targets of hub pages  Mutually enforcing relationship Hubs Authorities

22 Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Constructing a Focused Subgraph  For a search pattern  choose S  1.S is relatively small. 2.S is rich in relevant pages. 3.S contains most (or many) of the strongest authorities.  Start with the output of a standard text based search engine  Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)

23 Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Edge Selection  Offset the effect of links that serve purely a navigational function  Types of links –transverse if it is between pages with different domain names –intrinsic if it is between pages with the same domain name  Often intrinsic links very often exist purely for navigation –give much less information than transverse links about the authority of the pages they point to –therefore delete all intrinsic links from the focused subgraph  Other simple heuristics –Suppose a large number of pages from a single domain all point to a single page p. –often corresponds to a mass advertisement for example, the phrase “This site designed by...” and a corresponding link at the bottom of each page in a given domain. –To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page

24 Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mutual Enforcing Relationship  Weights –Authority weight of a web-page i: x i –Hub weight of a web-page i: y i  Authority indicated by hub pages (I-Operation)  Hub pages indicated by authority pages (O-Operation) –c 1, c 2 are normalization factors w.r.t to the L2-Norm

25 Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The HITS-Algorithm

26 Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing the Output  Does the algorithm converge?  How good is the output?

27 Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix Representation  Adjacency matrix A:  Authorities:  Hub weights:  After t Iterations:

28 Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer When does HITS converge?  M = A A T is symmetric matrix  For all symmetric matrices –all eigenvalues are real –all eigenvectors are orthogonal  There exists a representation  such that for the columns S i  If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges

29 Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999) (next time)

30 30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 8th lecture Next lecture:Mo 13 Dec 2004, 11.15 am, FU 116 Midterm exam:We 8 Nov 2004, 1pm, F1.110 Next exercise class: Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316


Download ppt "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 06 Dec."

Similar presentations


Ads by Google