1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng

2 Outline  Two important issues: Web Dynamics Search Engines  Web is related to Tim Berners-Lee? Bill Gates? Dik? Frederick?Wilfred ? (March 11, 1890 – June 30, 1974)

3 Introduction  The Web: the largest collection of (linked) resources (cf Memex machine in 1945, Xanadu in 1965, Internet in 1990) Memex machineXanadu  Web search engines: locating and retrieving Web information: Web search engines Crawler-based (Google, MSN Search,…) Human-powered (Yahoo directory, Open Directory)  Web is very dynamic: Dynamics of Web size Dynamics of Web pages Dynamics of Web link structure

4 Introduction (cont’)  Dynamics of Web size: Almost anyone can publish almost anything on the Web at almost zero-cost Web size grows at an exponential rate  Challenge for search engines: Scalability to cover a large part of the Web

5 Introduction (cont’)  Dynamics of Web pages: Creation: new pages come into existence  New information need to be captured by search engines Updates: content changes on a page (minor? major?)  Search engines should keep the local pages to be fresh Deletion: existing pages cannot be found  Search engines should detect deletions to avoid broken links

6 Introduction (cont’)  Dynamics of Web link structure: Links are being established and removed constantly  Important for search engines: Use the link structure to rank search results Eg: authoritative hubs

7 Introduction (cont’)  Relationship between three dimensions Dynamics of Web size Dynamics of Web pages Dynamics of Web link structure Web P +1 page

8 Preliminary  Search engine basic architecture: Web Search Engine CrawlerIndexerSearcher E End Users

9 Dynamics of Web Size  Two categories of the Web: Indexable Web (shallow Web):  Indexed by major engines  More than four billion pages by late 2003 [Google]  8 billion in 2004, 20 billion in 2005,??? Now [Google]  Non-indexable Web (deep Web):  Pages hidden behind search forms, or with authorization requirements, etc.  At least 400 times larger than indexable Web [Bergman00]

10 Web Size Study  The Web is growing at an exponential rate Netcraft Web Server Survey Report Netcraft Web Server Survey Report (August 1995 – November 2004)

11 Search Engine Coverage Studies  Bharat and Broder [1997]: Generate random URLs from a search engine Check whether these pages were in other engines Test on four search engines  AltaVista, Excite, Infoseek, HotBot Estimated Web size: 200 million pages The overlap between engines was very small

12 Search Engine Coverage Studies  Lawrence and Giles [1997]: Query-based sampling by scientists Test on six major search engines:  AltaVista, Excite, Infoseek, HotBot, Lycos, and Northern Light Estimated Web size: 320 million pages Single engine coverage is limited: 34% Join coverage increases significantly: 60%  Lawrence and Giles [1999]: Test on 11 search engines Estimated Web size: 320 million  800 million Single engine coverage: 34%  16%

13 Search Engine Coverage Studies  Summary: StudyWeb SizeLargest Engine Join Coverage Bharat and Broder (1997) 200 millionAltaVista (50%) 80% Lawrence and Giles (1997) 320 millionHotBot (34%) 60% Lawrence and Giles (1999) 800 millionNorthern Light (16%) 42%

14 Impact on Search Engines – Scalable Architecture  Google [Brin and Page 98]:Brin and Page 98 Data structure:  Compact encoding and compression Distributed crawling system:  Crawlers run in parallel  Each crawler keeps hundreds of connections

15 Impact on Search Engines – Metasearch Engines  Combine results of multiple engines to increase Web coverage  Metasearch engine: Query Search Engine 1 Search Engine n ResultsQuery Final Results CrawlerIndexer Searcher Metasearch engine

16 Impact on Search Engines – Special-purpose Search Engines  Not necessary to search the entire Web  Special-purpose search engines: Focus on restricted domains Use focused crawler Start with relevant seed pages Score the extracted URLs according to relevance Pick up the URL with highest score to crawl P1P1 P2P2 P3P3 P4P4 P5P5 Priority queue P5P5 P4P4 P5P5 P2P2 P6P6 P7P7 P3P3

17 Dynamics of Web Pages – Characterize Updates  Two measures [Lim02]: A Web page: an ordered sequence of words  Distance Measure: The degree of change: [0, 1]  Clusteredness Measure: How changes are spread out within a page: [0, 1]  Changes are generally small and clustered An incremental update is more efficient for search engines

18 Impact of Web Page Dynamics on Search Engines  A typical way to study page dynamics from a search engine perspective: 1.Develop a model for Web page changing 2.Propose update strategies to maximize the freshness for search engines Develop metrics to measure the freshness

19 Web Page Changing Model Studies – Poisson Process Model  Each page P i is updated at an average rate λ i  Poisson Process: X(t): the number of changes of page P in (0, t] Random variable X(s+t) – X(s) has Poisson probability distribution: for k = 0, 1, 2,…

20 Poisson Process – Brewington and Cybenko Study  Combine the effects of page creation and updates into the Poisson Web model  (α,β) – currency: Characterize how up-to-date a search engine is A page is β- current (β is a time unit) A search engine is (α,β) – current  Pr (P is β- current) >= α T = f (α,β, λ)  (0.95, 1 week) – currency: T = 18 days(800 million pages per day) t Now t - βt 0 Last observation β Grace period t 0 + T Re-indexing period T Grace period

21 Impact of Web Page Dynamics on Search Engines – Summary StudyCreationUpdatesDeletionFreshness Metric Brewington and Cybenko √√X(α,β) – currency Cho and Giacia-Molina X√Xfreshness age Edwards et al.√√X- Ntoulas, Cho and Olston √√√-

22 Dynamics of Web Link Structure – Web Link Structure Modeling  Web link structure [Broder et al. 00]  Four components: SCC (27.5%) IN (21.5%) OUT (21.5%) Tendrils and Tubes (21.5%) Others (8%)

23 Dynamics of Web Link Structure Study  Only one existing study  Ntoulas, Cho and Olston [04]: in one year Only 24% initial links were still available 25% new links created every week Link structure is more dynamic than pages (8% new pages and 5% new content in the same year!) Search engines should update link-based ranking metrics frequently

24 Link-based Ranking Metric – PageRank  PageRank [WWW98]: main ranking metric of Google  Definition: Page A has pages T 1 … T n (authoritative sites) pointing to it C(A): the number of links going out of page A d: damping factor in (0, 1) PR(A) = (1-d) + d(PR(T 1 )/C(T 1 ) + … + PR(T n )/C(T n ))

25 Incremental Update on PageRank  Computations are too expensive  Incrementally compute approximations to PageRank [Chien02]  Basic ideas: Construct a subgraph of the Web Contain small neighborhood of link changes Model the rest of the Web graph as a single node Compute PageRank on this subgraph

26 Conclusions  The Web is dynamic in three dimensions Serious challenges to search engines  Search engines to cope with high dynamics Scalable architecture, intelligent scheduling strategies, efficient update algorithm for ranking metrics, etc  Interesting to database people: Data representation dynamics: XML User dynamics: Adaptive search Deep Web dynamics: searchable? how? You should study COMP630L well  References References

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

Similar presentations

Presentation on theme: "1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

Similar presentations

Presentation on theme: "1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng."— Presentation transcript:

Similar presentations

About project

Feedback