Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 13 Dec.

Similar presentations


Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 13 Dec."— Presentation transcript:

1 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 13 Dec 2004 9th Lecture Christian Schindelhauer schindel@upb.de

2 Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 13 Dec 2004

3 Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching the Web  Introduction  The Anatomy of a Search Engine  Google’s Pagerank algorithm –The Simple Algorithm –Periodicity and convergence  Kleinberg’s HITS algorithm –The algorithm –Convergence  The Structure of the Web –Pareto distributions –Search in Pareto-distributed graphs

4 Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kleinberg’s HITS-Algorithm (HyperText Induced Search)  Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): 604-632(1999)  Idea of the Algorithm –Pages can serve as Authorities (like in pagerank) or Hubs –Hub pages point to interesting links to authorities = relevant pages E.g. railway fans collect links of railway companies –Authorities are targets of hub pages  Mutually enforcing relationship Hubs Authorities

5 Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Constructing a Focused Subgraph  For a search pattern  choose S  1.S is relatively small. 2.S is rich in relevant pages. 3.S contains most (or many) of the strongest authorities.  Start with the output of a standard text based search engine  Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)

6 Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Edge Selection  Offset the effect of links that serve purely a navigational function  Types of links –transverse if it is between pages with different domain names –intrinsic if it is between pages with the same domain name  Often intrinsic links very often exist purely for navigation –give much less information than transverse links about the authority of the pages they point to –therefore delete all intrinsic links from the focused subgraph  Other simple heuristics –Suppose a large number of pages from a single domain all point to a single page p. –often corresponds to a mass advertisement for example, the phrase “This site designed by...” and a corresponding link at the bottom of each page in a given domain. –To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page

7 Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Mutual Enforcing Relationship  Weights –Authority weight of a web-page i: x i –Hub weight of a web-page i: y i  Authority indicated by hub pages (I-Operation)  Hub pages indicated by authority pages (O-Operation) –c 1, c 2 are normalization factors w.r.t to the L2-Norm

8 Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The HITS-Algorithm

9 Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing the Output  Does the algorithm converge?  How good is the output?

10 Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Matrix Representation  Adjacency matrix A:  Authorities:  Hub weights:  After t Iterations:

11 Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer When does HITS converge?  M = A A T is symmetric matrix  For all symmetric matrices –all eigenvalues are real –all eigenvectors are orthogonal  There exists a representation  such that for the columns S i  If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges

12 Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Webgraph  G WWW : –Static HTML-pages are nodes –links are directed edges  Outdegree of a node: number of links of a web-page  Indegree of a node: number of links to a web-page  Directed path from node u to v –series of web-pages, where one follows links from the page u to page v  Undirected path (u=w 0,w 2,…,w m-1,v=w m ) from page u to page v –For all i: There is a link from w i zu w i+1 or from w i+1 to w i  Strong (weak) connected subgraph –minimal node set including all nodes which have a directed (undirected) path from and to a reference node

13 Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999)

14 Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Distributions of indegree/outdegree  In and Out-degree obey a power law –i.e. in- and out-degree appear with probability ~ 1/i α  According to experiments of –Kumar et al 97: 40 million Webpages –Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 –Broder et al 00: 204 million webpages (Scan May and Oct 1999)

15 Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Is the Web-Graph a Random graph?  Random graph G n,p : – n nodes –Every directed edges occurs with probability p  Is the Web-graph a random graph G n,p ?  Expected in/out-degree of G n,p = (n-1)p –Average degree of G WWW is constant, so choose –Consider a web-page w Let X be the number of links pointing from w Let X i =1 if link (w,i) exists, and X i =0, else Then P[X i =1]=p und P[X i =0]=1-p

16 Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph  What is the probability that at least k links apear 1.Markov‘s inequality –This implies

17 Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph  What is the probability that at least k links apear 2.Chebyshev‘s inequality –Since X i are independent –This implies

18 Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The in/out degree distribution of the random graph 3.Chernoff bound –For independent Bernoulli variable X i and with –This implies for –So, the probability decrease exponentially –Therefore: The degree of a random graph does not obey a power law

19 Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto Distribution  Discrete Pareto (power law) distribution for x  {1,2,3,…} with constant factor (also known as the Riemann Zeta function)  Heavy tail property –not all moments E[X k ] are defined –Expected value exists if and only if α>2 –Variance and E[X 2 ] exist if and only if α>3 –E[X k ] defined if and only if α>k+1  Density function of the continuous function for x>x 0

20 Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Special Case: Zipf Distribution  George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c  Zipf probability distribution for x  {1,2,3,…} with constant factor c only defined for finite sets, since tends to infinity for growing n  Zipf distributions refer to ranks –The Zipf exponent  can be larger than 1, i.e. f(n) = c/n   Pareto distributions refer to absolute size –e.g. number of inhabitants

21 Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Pareto-Verteilung (I)  Example for Power Laws (= Pareto distributions) –Pareto 1897:Wealth/income in population –Yule 1944:Word frequency in languages –Zipf 1949:Size of towns –Length of molecule chaings –File length of UNIX-files –…. –Access density of web-pages –Access density of a web-surfer at a particular web-page –…

22 Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer City Size Distribution Scaling Laws and Urban Distributinos, Denise Pumain, 2003 Zipf distribution

23 Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002 Pareto distribution

24 Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

25 Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Zipf’s Law and the Internet Lada A. Adamic, Bernardo A. Huberman, 2002

26 Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Heavy-Tailed Probability Distributions in the World Wide Web Mark Crovella, Murad, Taqqu, Azer Bestavros, 1996

27 Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Size of connected components  Strong and weak connected components obey a power law  A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000.  Large weak connected component with 91% of all web-pages  Largest strong connected component has size 28% –Diameter ≥ 28

28 Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-Graph (1999)

29 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 9th lecture Next lecture:Mo 20 Dec 2004, 11.15 am, FU 116 Results and solutions of exam:Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316


Download ppt "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 13 Dec."

Similar presentations


Ads by Google