distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Slides:



Advertisements
Similar presentations
Ningning HuCarnegie Mellon University1 Optimizing Network Performance In Replicated Hosting Peter Steenkiste (CMU) with Ningning Hu (CMU), Oliver Spatscheck.
Advertisements

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Web Crawling Notes by Aisha Walcott
Copyright © 2002 Pearson Education, Inc. Slide 3-1 PERTEMUAN 5.
EEC-484/584 Computer Networks Discussion Session for HTTP and DNS Wenbing Zhao
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
Bluenet a New Scatternet Formation Scheme * Huseyin Ozgur Tan * Zifang Wang,Robert J.Thomas, Zygmunt Haas ECE Cornell Univ*
Dynamic parallel access to replicated content in the Internet Pablo Rodriguez and Ernst W. Biersack IEEE/ACM Transactions on Networking, August 2002.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.
Parallel and Distributed IR
HTTP Performance Objective: In this problem, we consider the performance of HTTP, comparing non-persistent HTTP with persistent HTTP. Suppose the page.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
EEC-484/584 Computer Networks Lecture 7 Wenbing Zhao (Part of the slides are based on Drs. Kurose & Ross ’ s slides for their Computer.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Internet basics, Browsers, application, advantages and disadvantages, architecture, WWW, URL, HTML Week 10 Mr. Mohammed Rahmath.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Lesson 8 DATA EXCHANGE. Transmission Modes Type 1 - Simplex  Simplex transmission: sends data in one direction only. A radio broadcast is a good example.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Master Thesis Defense Jan Fiedler 04/17/98
1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.
SCrawler Group: Priyanshu Gupta WHAT WILL I DO?? I will develop a multi-threaded parallel crawler.I will run them in both cross-over and Exchange mode.
Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Advanced Crawling Techniques Chapter 6.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
The Online World DATA EXCHANGE 2. Introduction Devices on a network use a variety of methods to communicate with each other and to transmit data. This.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
CONNECTING TO THE INTERNET THROUGH ISP. WHAT IS INTERNET? The Internet is a worldwide collection of computer networks, cooperating with each other to.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
Discovering Changes on the Web What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas Junghoo Cho Christopher.
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
UbiCrawler: a scalable fully distributed Web crawler
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
IST 497 Vladimir Belyavskiy 11/21/02
CS246 Search Engine Scale.
Junghoo “John” Cho UCLA
CS246: Search-Engine Scale
Algorithms for Selecting Mirror Sites for Parallel Download
Presentation transcript:

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks. The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.

distributed web crawlers2 Firewall Mode & Coverage Firewall: –every crawl collects pages only from its predetermined partition, and follows only intra- partition links. –Has a minimal communication overhead, but may have quality and coverage problems.

distributed web crawlers3 Firewall Mode & Coverage Considering the 40m pages as the entire web. Using site-hash based partitioning. Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).

distributed web crawlers4 Results

distributed web crawlers5 Results (2)

distributed web crawlers6 Conclusions When a small number of c-proc’s run in parallel this mode provides good coverage, and the crawler may start with relatively small number of seed URLs. This mode is not a good choice when coverage is important, especially when it runs many c-proc’s in parallel.

distributed web crawlers7 Example Assuming we want to download 1B pages over one month, with 10 Mbps link to the internet per each c-proc’s machine: –we need to download 10 9 X10 4 bytes. –The download rate is 34 Mbps therefore we need 4 c-proc’s. from fig 4 we conclude that coverage will be about 80%.fig 4 –having a week, we need a download rate of 140 Mbps = 14 c-proc’s, which will cover only 50%.

distributed web crawlers8 Cross-over & Overlap This mode may yield improved coverage, since it follows inter-partition links, when a c-proc runs out of links in its own partition. This mode also incurs overlap, because a page can be downloaded by several c-procs. => the crawler increases coverage at the expense of overlap.

distributed web crawlers9 Cross-over & Overlap Considering the 40M pages as the entire web. Using site-hash based partitioning. Each c-proc was given five random sites from its own partitioning (5n for the overall crawler). Measuring overlap in various coverage points

distributed web crawlers10 Results

distributed web crawlers11 Conclusions While this mode is much better than independent crawl, it still incurs quite significant overlap. For example: 4 c-proc’s running will overlap almost 2.5 in order to obtain coverage close to 1. For this reason it is not recommended to use this mode unless coverage is important and no communication between c-proc’s is available.

distributed web crawlers12 Exchange Mode & Communication In this section we learn the communication overhead of an exchange mode crawler and how to reduce it by replication. We split the 40M pages into n partitions based on site-hash value, and run n c-proc’s in exchange mode.

distributed web crawlers13 Results

distributed web crawlers14 Conclusions The site-hash based partitioning scheme significantly reduces communication overhead, compared with the URL-hash based scheme. In average we need to transfer less than 10% of the discovered links (or up to 1 per page)..

distributed web crawlers15 Conclusions (2) Network bandwidth used for URL exchange is relatively small. URL’s average length is about 40 bytes, while an average page is about 10kb, so this transfer consumes about 0.4% of total network bandwidth.

distributed web crawlers16 Conclusions (3) The overhead of this exchange is quite significant because transmission goes through TCP/IP network stack at both sides, and incurs 2 switches between kernel and user mode.

distributed web crawlers17 Reducing Overhead by Replication

distributed web crawlers18 Conclusions Based on this result replicating thousands in each c-proc will give best results (minimizes communication overhead while maintaining low replication overhead).

distributed web crawlers19 Quality & Batch Communication In this section we study the quality issue –as mentioned parallel crawler can be worse than single process crawler if every c-proc decides solely based on personal information.

distributed web crawlers20 Quality & Batch Communication (2) throughout this section we’ll regard a page’s importance I(p) as the number of backlinks it has. –The most common metric. –Obviously depends on how often c-proc’s are exchanging information.

distributed web crawlers21 Quality at Different Exchange Rates

distributed web crawlers22 Conclusions As number of c-proc’s increases, the quality becomes worse, unless they exchange backlink messages often. The quality of a firewall mode is worse than a single process crawler when downloading small fraction of pages. However there is no difference when downloading bigger fractions.

distributed web crawlers23 Quality and Communication Overhead

distributed web crawlers24 Conclusions Communication overhead doesn’t increase linearly. Large number of URL exchanges is not necessary for achieving high quality, especially when downloading large portion of the web. fig 9.fig 9.

distributed web crawlers25 Final Example Say we plan to operate a medium-scale search engine, to obtain 20% of the web (240M pages). We plan to refresh the index once a month, and our machines have 1 Mbps connection to the Internet. –We need about 7.44 Mbps download bandwidth, so we have to run at least 8 c-procs run in parallel.

distributed web crawlers26 Related charts

distributed web crawlers27 Final Conclusions When a small number of c-procs run in parallel, firewall mode provides good coverage. Given the simplicity of this mode, it is a good option to consider unless: –More than 4 c-procs are required. fig 4.fig 4. –Small subset of the web is required and quality is important. Fig 9.Fig 9.

distributed web crawlers28 Final Conclusions (2) Exchange mode based crawler consumes small network bandwidth, and minimize overhead if batch communication is operated. Quality is maximized even if less than 100 URL exchanges occurs. Replication of 10, ,000 most popular URLs reduces communication overhead by roughly 40%. Further replication contributes little. Fig 8.Fig 8.

distributed web crawlers29 References Junghoo Cho, Hector Garcia-Molina. Parallel crawlers, October Mike Burner, Crawling Towards Eternity, web techniques magazine, May 1998.