Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004.

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Mining the WebChakrabarti and Ramakrishnan1 Overview of Web-Crawlers  Neal Richter & Anthony Arnone Nov 30, 2005 – CS Conference Room These slides are.
A Brief Look at Web Crawlers Bin Tan 03/15/07. Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Web Crawlers.
Crawlers - March (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
A Web Crawler Design for Data Mining
Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Crawling Slides adapted from
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Statistics Visualizer for Crawler
Lecture 17 Crawling and web indexes
CS 430: Information Discovery
The Anatomy of a Large-Scale Hypertextual Web Search Engine
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
CSE 542: Operating Systems
Presentation transcript:

Claudio Scordino Ph.D. Student Crawling the Web: problems and techniques May 2004

Outline Introduction Crawler architectures - Increasing the throughput What pages we do not want to fetch - Spider traps - Duplicates - Mirrors

Introduction Job of a crawler (or spider): fetching the Web pages to a computer where they will be analyzed The algorithm is conceptually simple, but… …its a complex and underestimate activity

Famous Crawlers Mercator (Compaq, Altavista) Java Modular (components loaded dynamically) Priority-based scheduling for URLs downloads - The algorithm is a pluggable component Different processing modules for different contents Checkpointing - Allows the crawler to recover its state after a failure - In a distributed crawler is performed by the Queen

Famous Crawlers GoogleBot (Stanford, Google) C/C++ WebBase (Stanford) HiWE: Hidden Web Exposer (Stanford) Heritrix (Internet Archive)

Famous Crawlers Sphinx Java Visual and interactive environment Relocatable: capable of executing on a remote host Site-specific - Customizable crawling - Classifiers: site-specific content analyzers 1.Links to follow 2.Parts to process -Not scalable

Crawler Architecture Load Monitor SCHEDULER Crawl Metadata Duplicate URL Eliminator URL Filter Hosts HREFs extractor and normalizer PARSER Interne t seed URLs URL FRONTIER Citations RETRIEVERS DNS HTTP

Web masters annoyed Web Server administrators could be annoyed by: 1. Server overload -Solution: per-server queues 2. Fetching of private pages -Solution: Robot Exclusion Protocol -File: /robots.txt

Crawler Architecture Per-server queues Robots

Mercators scheduler BACK-END: ensures politeness (no server overload) FRONT-END: prioritizes URLs with a value between 1 and k Queues containing URLs of only a single host Specifies when a server may be contacted again

Increasing the throughput Possible levels of parallelization: Parallelize the process to fetch many pages at the same time (~thousands per second). DNS HTTP Parsing

Domain Name resolution Problem: DNS requires time to resolve the server hostname

Domain Name resolution 1.Asynchronous DNS resolver: Concurrent handling of multiple outstanding requests Not provided by most UNIX implementations of gethostbyname GNU ADNS library Mercator reduced the threads elapsed time from 87% to 25%

Domain Name resolution 2.Customized DNS component: Caching server with persistent cache largely residing in memory Prefetching Hostnames extracted by HREFs and requests made to the caching server Does not wait for resolution to be completed

Crawler Architecture Per-server queues Robots Async DNS prefetch DNS Cache DNS resolver client

Page retrieval 1.Multithreading Blocking system calls (synchronous I/O) pthreads multithreading library Used in Mercator, Sphinx, WebRace Sphinx uses a monitor to determine the optimal number of threads at runtime Mutual exclusion overhead Problem: HTTP requires time to fetch a page

Page retrieval 2.Asynchronous sockets not blocking the process/thread select monitors several sockets at the same time Does not need mutual exclusion since it performs a serialized completion of threads (i.e. the code that completes processing the page is not interrupted by other completions). Used in IXE (1024 connection at once)

Page retrieval 3.Persistent connection Multiple documents requested on a single connection Feature of HTTP 1.1 Reduce the number of HTTP connection setups Used in IXE

IXE Crawler

IXE Parser Problem: parsing requires 30% of execution time Possible solution: distributed parsing

IXE Parser URL1 URL2 Cache Parser URL Table Manager (Crawler) Table Citations URL1 URL2 URL1 URL2 DocID1 DocID2 DocID1 DocID2 URL1 URL2

A distributed parser Cache Scheduler Citations Parser 1 Table 1 Table 1 Manager Parser N Table 2 Table 2 Manager Hash (URL1) Manager2 URL1 URL2 Sched () Parser1 URL1 URL2 URL1 URL2 URL1 URL2 DocID2 DocID1 Hash(URL2) Manager1 ? New DocID HIT MISS

A distributed parser Does this solution scale? -High traffic on the main link Suppose that: -Average page size = 10KB -Average out-links per page = 10 -URL size = 40 characters (40 bytes) -DocID size = 5 byte X = throughput (pages per second) N = number of parsers

A distributed parser Bandwidth for web pages: -X*10*1024*8 = 81920*X bps Bandwidth for messages (hit): -X/N * 10 * (40+5) * 8 * N = 3600*X bps Pages per parser Outlinks per page DocID Reply Byte bit Number of parsers DocID Request Using 100Mbps : X = 1226 pages per second

What we dont want to fetch 1.Spider traps 2.Duplicates 2.1 Different URLs for the same page 2.2 Already visited URLs 2.3 Same document on different sites 2.4 Mirrors At least 10% of the hosts are mirrored

Spider traps Spider trap: hyperlink graph constructed unintentionally or malevolently to keep a crawler trapped 1.Infinitely deep Web sites Problem: using CGI is possible to generate an infinite number of pages Solution: check of the URL length

Spider traps 2.Large number of dummy pages Example: e/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/flyfa ctory/flyfactory/hatchline/flyfactory/hatchline/ Solution: disable crawling a guard removes from consideration any URL from a site which dominates the collection

Avoid duplicates Problem almost nonexistent in classic IR Duplicate content wastes resources (index space) annoys users

Virtual Hosting Problem: Virtual Hosting Allows to map different sites to a single IP address Could be used to create duplicates Feature of HTTP 1.1 Rely on canonical hostnames (CNAMEs) provided by DNS

Already visited URLs Problem: how to recognize an already visited URL ? The page is reachable by many paths We need an efficient Duplicate URL Eliminator

Already visited URLs 1.Bloom Filter Probabilistic data structure for set membership testing Problem: false positivs new URLs marked as already seen URL hash function 1 hash function 2 hash function n BIT VECTOR 0/1

Already visited URLs 2.URL hashing MD5 Using a 64-bit hash function, a billion URLs requires 8GB -Does not fit in memory -Using the disk limit the crawling rate to 75 downloads per second MD5 URL Digest 128 bits

Already visited URLs 3.two-level hash function The crawler is luckily to explore URLs within the same site Relative URLs create a spatiotemporal locality of access Exploit this kind of locality using a cache PathHostname+Port 24 bits 40 bits

Content based techniques Problem: how to recognize duplicates basing on the page contents? 1.Edit distance Number of replacements required to transform one document to the other Cost: l1*l2, where l1 and l2 are the lenghts of the documents: Impractical!

Content based techniques Problem: pages could have minor syntatic differences ! site mantainers name, latest update anchors modified different formatting 2.Hashing A digest associated with each crawled page Used in Mercator Cost: one seek in the index for each new crawled page

Content based techniques 3.Shingling Shingle (or q-gram): contiguous subsequence of tokens taken from document d representable by a fixed length integer w-shingle: shingle of width w S(d,w): w-shingling of document d unordered set of distinct w-shingles contained in document d

Content based techniques a rose is a rose is a rose Sentence: Tokens: a rose is a rose isa rose a,rose,is,a rose,is,a,rose is,a,rose,is a,rose,is,a rose,is,a,rose 4-shingles: S(d,4): a,rose,is,a rose,is,a,rose is,a,rose,is

Content based techniques Each token = 32 bit w = 10 (suitable value) S(d,10) = set of 320-bits numbers We can hash the w-shingles and keep 500 bytes of digests for each document w-shingle=320 bit

Content based techniques Resemblance of documents d1 and d2: Jaccard coefficient Eliminate pages too similar (pages whose resem- blance value is close to 1)

Mirrors access method hostname path URL Precision = relevant retrieved docs / retrieved docs

Mirrors 1.URL String based Vector Space model: term vector matching to compute the likelyhood that a pair of hosts are mirrors terms with df(t) < 100

Mirrors a)Hostname matching Terms: substrings of the hostname Term weighting: len(t)= number of segments obtained by breaking the term at. characters This weighting favours substrings composed by many segments very specific 27%

Mirrors b)Full path matching Terms: entire paths Term weighting: Connectivity based filtering stage: Idea: mirrors share many common paths Testing for each common path if it has the same set of out-links on both hosts Remove hostnames from local URLs mdf = max df(t) t collection 59% +19%

Mirrors c)Positional word bigram matching Terms creation: Break the path into a list of words by treating / and. as breaks Eliminate non-alphanumeric characters Replace digits with * (effect similar to stemming) Combine successive pairs of words in the list Append the ordinal position of the first word 72%

Mirrors conferences/d299/advanceprogram.html conferences d* advanceprogram html conferences_d*_0 d*_advanceprogram_1 advanceprogram_html_2 Positional Word Bigrams

Mirrors 2.Host connectivity based Consider all documents on a host as a single large document Graph: host node document on host a pointing to a document on host B directed edge from A to B Idea: two hosts are likely to be mirrors if their nodes point to the same nodes Term vector matching -Terms: set of nodes that a hosts node points to 45%

References S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext and Semi Structured Data, Pages 17-43, S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th World Wide Web Conference (WWW7), A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler, World Wide Web Conference, K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of Techniques to Find Mirrored Hosts on the WWW, Journal of the American Society for Information Science, 2000.

References A.Heydon and M.Najork, High performance Web Crawling, Technical Report, SRC Research Report, 173, Compaq Systems Research Center, 26 September R.C.Miller and K.Bharat, SPHINX: a framework for creating personal, site-specific web crawlers, Proceedings of the 7th World-Wide Web Conference, D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a Distributed Crawler and Filtering Processor, Proceedings of the 5th Workshop on Next Generation Information Technologies and Systems (NGITS 2002), June 2002.