Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Advertisements

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Resource discovery Crawling on the web. With millions of servers and billions of web pages, the problem of finding a document without already knowing.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The Internet October 30, The Internet URL’s Search Engines Boolean Operators Internet Searches Scavenger Hunt.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engine Architecture
1 Searching the Web Representation and Management of Data on the Internet.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evolution of Web from a Search Engine Perspective Saket Singam
Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
CS-791/891--Preservation of Digital Objects and Collections
UbiCrawler: a scalable fully distributed Web crawler
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Search Engine Architecture
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
CS246 Page Refresh.
Yoram Bachrach Yiftah Ben-Aharon
Junghoo “John” Cho UCLA
Search Engine Architecture
Presentation transcript:

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection Page refresh : Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5]) Observations: - Frequent changes - Significant differences, e.g. among domains Hence: Update rule necessary

3. Page Refresh (Update Rules) Problem : The web is continuously changing Goal : Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers : Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers : Continuously crawl the web and incrementally update your collection

3.2 Incremental Crawlers Freshness of a page p i at time t Freshness of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

3.2 Incremental Crawlers Age of a page p i at time t Age of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

3.2 Incremental Crawlers Time average of freshness of page p i at t Time average of freshness of a local collection P at time t ( Time average of age : analogous) Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

Example for Freshness and Age ELEMENT IS CHANGED SYNCHRONIZED AGE FRESHNESS (SOURCE: [6])

Design alternative 1: Batch mode vs. steady crawler Batch mode crawler : Periodic update of all pages of a collection Steady crawler : Continuous update BATCH MODE CRAWLERSTEADY CRAWLER FRESHNESS TIME (MONTH) FRESHNESS Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)

Design alternative 2: In-place vs. shadowing Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded Shadowing keeps two collections: The crawlers collection and the current collection BATCH MODE CRAWLERSTEADY CRAWLER

Design alternative 3: Fixed vs. variable frequency Fixed frequency / uniform refresh policy : Same access rate to all pages (independent of their actual rate of change) Variable frequency : Access pages depending on their rate of change Example: Proportional refresh policy

Variable frequency update Obvious assumption for a good strategy: Visit a page that changes frequently more often Wrong!!! The optimum update strategy (if we assume a distribution of Poisson) looks like this: RATE OF CHANGE OF A PAGE OPTIMUM UPDATE TIME

Variable frequency update (cont.) Why is this a better strategy? Illustration with a simple example: P 1 P 2

Summary of different design alternatives Steady In-place update Variable frequency Batch-mode Shadowing Fixed frequency vs.

3.3 Expl. for an Incremental Crawler Two main goals: - Keep the local collection fresh Regular, best-possible updates of the pages in the index - Continuously improve the quality of the collection Replace existing pages with low quality through new pages with higher quality

3.3 Expl. for an Incremental Crawler WHILE ( TRUE ) URL = SELECT_TO_CRAWL (ALL_URLS); PAGE = CRAWL (URL); IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE) ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL} NEW_URLS = EXTRACT_URLS (PAGE); ALL_URLS = ALL_URLS U NEW_URLS;

3.3 Expl. for an Incremental Crawler ALL_URLS COLL_URLS ADD_URLS UPDATE/SAVE COLLECTION RANKING MODULE DISCARDSCAN ADD/REMOVE CRAWL MODULE UPDATE MODULE CRAWL CHECK SUM POP PUSH BACK

References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000) [5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003 [6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000