Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Similar presentations


Presentation on theme: "Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

1 Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London The slides are adapted from Prof. Mark Levene’s at http://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt

2 The Size of the Web Lawrence and Giles 1999 – 800 million Over 11.5 billion in 2005 (Google indexes over 8 billion) Coverage – about 40% in 1999 Overlap - low Overlap The deep (or hidden or invisible) web contains 400- 550 times more information.

3 Capture Recapture SE1 : the reported size of search engine 1. QSE1 and QSE2 : the pages returned for a set of queries Q from two engines. OVR : the overlap of QSE1 and QSE2 Estimate of Web size: (QSE2 x SE1) / OVR a.k.a. Mark and RecaptureMark and Recapture OVR / QSE2 = SE1 / Web

4 Diameter of the Web Compute Average shortest path between pairs of pages that have a path from one to the other. Broder 99 – directed 16.2, undirected 6.8 Barabasi 99 – directed for nd.edu 19 Small diameter is a charactersitic of a small world network Choose random source and destination – 75% of the time no directed path between them.

5 Bowtie Model of the Web Broder et al. 1999 – crawl of over 200 million pages and 1.5 billion links.  SCC – 27.5%  IN and OUT – 21.5%  Tendrils and tubes – 21.5%  Disconnected – 8%

6 Link Degree Distributions How many page have n=1,2,… links:  indegree :  outdegree : The log-log plots are linear!

7 What is a Power Law f(i) is the proportion of objects having property i  E.g. f(i) = # pages, i = # inlinks  E.g. f(i) = # sites, i = # pages  E.g. f(i) = # sites i = # users  E.g. f(i) = frequency of word, i = rank of word, from most freqeunt to least frequent The log-log plot: linear relationship (straight line)

8 Power Laws on the Web inlinks (2.1) outlinks (2.72) Strongly connected components (2.54) No. of web pages in a site (2.2) No. of visitors to a site during a day (2.07) No. links clicked by web surfers (1.5) PageRank (2.1)

9 Preferential Attachment or The Rich Get Richer How Power Laws Arise

10 Scale-Free Networks Classic Random Graphs

11 Take Home Messages The Web Graph  Large and Sparse Capture Recapture  Small World Network 19 Degrees of Separation  Scale Free Network The Power Law Rich Get Richer


Download ppt "Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang."

Similar presentations


Ads by Google