CS728: Internet Studies and Web Algorithms Lecture 2: Web Crawlers and DNS Algorithms April 3, 2008.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Scheduling. Main Points Scheduling policy: what to do next, when there are multiple threads ready to run – Or multiple packets to send, or web requests.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
1 Drafting Behind Akamai (Travelocity-Based Detouring) AoJan Su, David R. Choffnes, Aleksandar Kuzmanovic, and Fabian E. Bustamante Department of Electrical.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
Proxy Cache Leonid Romanovsky Olga Fomenko Winter 2003 Instructor: Konstantin Sinyuk.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
ES 101. Module 3 Domain Name System (DNS). Last Lecture Routing and IP addressing.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
IT 210 The Internet & World Wide Web introduction.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Lecture 8 Page 1 Advanced Network Security Review of Networking Basics: Internet Architecture, Routing, and Naming Advanced Network Security Peter Reiher.
Web application architecture
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 7: Domain Name System.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Name Resolution.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Crawling Slides adapted from
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Lecture 8 – Cookies & Sessions SFDV3011 – Advanced Web Development 1.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Client-Server Model of Interaction Chapter 20. We have looked at the details of TCP/IP Protocols Protocols Router architecture Router architecture Now.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Web Server.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Mapping IP Addresses to Hardware Addresses Chapter 5.
CSI 3125, Preliminaries, page 1 Networking. CSI 3125, Preliminaries, page 2 Networking A network represents interconnection of computers that is capable.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 17 Crawling and web indexes
CS 430: Information Discovery
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anwar Alhenshiri.
Algorithms for Selecting Mirror Sites for Parallel Download
Presentation transcript:

CS728: Internet Studies and Web Algorithms Lecture 2: Web Crawlers and DNS Algorithms April 3, 2008

2 Web Crawler basic algorithm 1.Maintain list of unvisited URLs 2.Remove a URL from the unvisited URL list 3.Use DNS lookup to determine the IP Address of its host name 4.Download the corresponding document 5.Parse doc, and extract any links contained in it. 6.Check if the URL is new, if yes add it to the list of unvisited URLs 7.Post process the downloaded document 8.Back to step 1

3 Single-threaded Crawler Initialize URL list with starting URLs Termination ? Pick URL from URL list Parse page Add URL to URL List [No more URL] [URL] [not done] [done] Crawling loop

4 Multithreaded Crawler Check for termination Add URL Get URL Thread Lock URL List Pick URL from List Unlock URL List Fetch Page Parse Page Check for termination Add URL Get URL Lock URL List Pick URL from List Unlock URL List Fetch Page Parse Page Thread end

5 Parallel Crawler Queues of URLs to visit Local Connect Collected Pages C - Proc Local Connect Internet

6 Modified Breadth-first search and URL frontier Crawlers maintain the URLs in the frontier and regurgitates them in some order whenever a crawler thread seeks a URL. Two important considerations govern the order: Quality: pages that change frequently should be prioritized Politeness: must avoid repeated fetch requests to a host within a short time span. A common heuristic is to insert a gap between successive fetch requests to a host that is an order of magnitude larger than the time taken for the most recent fetch from that host.

7 Robot Exclusion Protocol Contains part of the web site that a crawler should not visit. Placed at the root of a web site, robots.txt An example that tells all crawlers not to enter into three directories of a website: User-agent: * Disallow: /cgi-bin/ Disallow: /registration/ Disallow: /login/

8 Search Engine : architecture WWW Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Results

9 Search Engine : major components Crawlers Collects documents by recursively fetching links from a set of starting pages. Each crawler has different policies The pages indexed by various search engines are different The Indexer Processes pages, decide which of them to index, build various data structures representing the pages (inverted index,web graph, etc), different representation among search engines. Might also build additional structure ( LSI ) The Query Processor Processes user queries and returns matching answers in an order determined by a ranking algorithm.

10 Issues for crawlers 1.General software architecture 2.What pages should the crawler download ? 3.How should the crawler refresh pages ? 4.How should the load on the visited web sites be minimized? 5.How should the crawling process be parallelized? 6.What impact is performance of DNS system? Later, we will look at SE design and Web page analysis.

11 Web Crawlers and DNS Resolution DNS resolution is a well-known bottleneck in web crawling. Entails multiple requests and round-trips across the internet. Puts in jeopardy the goal of fetching several hundred documents a second. Caching: URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache, avoiding the need to go to the DNS servers on the internet. Limited cache hit rate when obeying politeness constraints. Most web crawlers implement their own DNS resolver as a component of the crawler, since standard DNS methods are synchronous (only 1 request at a time). Basic Design: –A crawler thread sends a message to the DNS server and then performs a timed wait –A separate DNS-listen thread listens on the standard DNS port (port 53) for incoming response packets from the name service. Upon receiving a response, it signals the appropriate crawler thread. –A crawler thread that resumes operation because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing a timed wait each time. Various time-out strategies are possible: Mercator recommends five attempts. The time quantum increases exponentially with each of these attempts; Mercator started with one second and ended with roughly 90 seconds, in consideration of the fact that there are host names that take tens of seconds to resolve.

12 DNS Algorithms The objective of any DNS algorithm is two-fold –find quickest server (smallest latency) –find most accurate answers (highest reliability) BIND (Berkeley Internet Name Domain) is the most commonly used DNS software - originally created by four graduate students! BIND algorithms are designed to attempt to pick the fastest server, but not to use one server exclusively. Why? BIND uses a ranking score on servers that combines –average latency time for a query –a penalty on a server each time it is picked

13 BIND Algorithm Example, say there are two name servers for the domain uc.edu, and BIND may start by querying one, say , and penalizing it proportional to the time that it takes for the query. Thus, will eventually be queried and penalized in the same fashion. The process will continue by choosing server with the least total penalty. Questions: –How should we penalize? –How well do different strategies work? –Can we compare it to an optimal strategy?

14 Analysis of BIND Let us use the following notation: N = number of potential name servers T[i](t) = time taken server i to complete request at step t. S(t) = server chosen to query on step t The goal is to minimize the average query time AQT AQT = lim 1/t ∑ T[S(t)](t) as t goes to infinity The simplest BIND algorithm works as follows on each step: Let R[i](t) = total query time used for server i before step t 1. S(t) is picked to be the server i such that R[i](t) is a minimum (with arbitrarily broken ties). 2. If S(t) = i, then update R[i] (t+1) = R[i](t) + T[i](t) and all other R[i] are unchanged.

15 Simplifying assumption: the T[i](t) are constant over all time Clearly, in this case, the optimal algorithm always picks the server with the smallest T[i] value. Theorem: Assuming the T[i] are constant, the simple BIND algorithm has avg. query time AQT = N / ∑ 1/ T[ i ] Example: 3 servers: T[1]=1ms, T[2]=2ms, T[3]=3ms AQT= 3 / (1/1+1/2+1/3) = better than random?? Proof: Let n[i](t) be the number of times server i is picked before step t. Then, since the T[i] are constant, R[i](t) = T[i] n[i](t). Consider case 2 servers a and b. After t steps, server a will be chosen if R[a](t) is less than R[b](t), which implies T[a] n[a](t) < T[b] n[b](t). Analysis of BIND continued

16 In the limit these values will roughly equalize. So too does the ratios tend to equalize: n[a](t) / n[b](t) = T[b]/T[a] So server a is chosen a number of times inversely proportional to T[a]. This holds for all servers. Thus we have that the probability that any server a will be chosen is c/T[a] for some constant c. The “law of total probability” says that these must sum to 1, hence we get c = 1/ ∑1/ T[i]. Now to calculate AQT, server a will be picked with probability c/T[a] and take T[a] time to run. Thus the expected running time is ∑ (c/T[a]) T[a] = Nc = N / ∑ 1/T[i]

17 So AQT is N / ∑ 1/T[i] When is this good, when bad? Suppose N=2…. Let T[1]= 1 and let T[2]= x for some large number x. How bad can AQT be? AQT=2/(1+1/x) Can you think of a really bad case? A case which makes AQT N times as bad the optimal? Theorem: There is a case where the simple BIND algorithm is at least N(1-e) times as bad as the optimal algorithm. Proof: Set T[1] = 1, and T[i] = (N-1)/e for all i> 1. Then the average query time AQT is N/(1+e) as given by Theorem 1 above; while the optimal algorithm would always choose server 1 each time. Hence, performance time goes up as the number of servers does  ! Is this unlikely in practice? Maybe not, If there is one nearby server and several far away.

18 BIND8 is the successor to simple version of BIND BIND8 accounts for accuracy of results in the scoring update. BIND8 uses running average of latencies for selected servers, and an exponential decay on unselected servers Uses three factors as follows: R[i](t+1) = a T[i](t) + (1-a) R[i](t) if i is selected and correct answer = b R[i](t) if i selected and incorrect answer = c R[i](t) if server i is not selected Typical values selected are a= 0.3, b =1.2 and c= 0.98 Surprisingly this enhanced algorithm actually has worse theoretical bound on performance.

19 Theorem: BIND8 is arbitrarily bad as compared with the optimal algorithm even with only 2 servers. Recall that Simple BIND gave a 2-appoximation on 2 servers. Proof. Let T[1]= 1 and let T[2]= x for some large number x. Then server 1 is selected and the get resulting sequence of scores on server 2: R[2] = x, cx, c 2 x, …, c n x which is >1 as long as n < log 1/c x. Hence average query time AQT = 1 + x/ log 1/c x. So for x large the AQT is arbitrarily bad.

20 Manipulating DNS to address Locality Problems in Applications Web service providers can manipulate the DNS system to better serve their clients. Problem: we wish for clients to choose “best” web server for themselves based on their locality Hence this requires changes to the usual DNS lookup. A domain name such as must not translate directly to an IP address such as , but rather must be aliased to an intermediate address. DNS allows for aliases called a “CNAME”, for example a CNAME for xyz.com could be a212.g.akamai.net The CNAME address is then be resolved by DNS to an IP address of an optimal server, or possibly another CNAME.