Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations


Presentation on theme: "Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

1 Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010

2 May-20-10CS572-Summer2010CAM-2 Outline Crawlers –Web –File-based Characteristics Challenges

3 May-20-10CS572-Summer2010CAM-3 Why Crawling? Origins were in the web –Web is a big “spiderweb”, so like a a “spider” crawl it Focused approach to navigating the web –It’s not just visit all pages at once –…or randomly –There needs to be a sense of purpose Some pages more important or different than others Content-driven –Different crawlers for different purposes

4 May-20-10CS572-Summer2010CAM-4 Different classifications of Crawlers Whole-web crawlers –Must deal with different concerns than more focused vertical crawlers, or content-based crawlers –Politeness, ability to mitigate any and all protocols defined in the URL space –Deal with URL filtering, freshness and recrawling strategies –Examples: Heretix, Nutch, Bixo, crawler-commons, clever uses of wget and curl, etc.

5 May-20-10CS572-Summer2010CAM-5 Different classifications of Crawlers File-based crawlers –Don’t necessitate the understanding of protocol negotiation – it’s a hard problem in its own right! –Assume that the content is already local –Uniqueness is in the methodology for File identification and selection Ingestion methodology Examples: OODT CAS, scripting (ls/grep/UNIX), internal appliances (Google), Spotlight

6 May-20-10CS572-Summer2010CAM-6 Web-scale Crawling What do you have to deal with? –Protocol negotiation How do you get data from FTP, HTTP, SMTP, HDFS, RMI, CORBA, SOAP, Bittorrent, ed2k URLs? Build a flexible protocol layer like Nutch did? –Determination of which URLs are important or not Whitelists Blacklists Regular Expressions

7 May-20-10CS572-Summer2010CAM-7 Politeness How do you take into account that web servers and Internet providers can and will –Block you after a certain # of concurrent attempts –Block you if you ignore their crawling desirements codified in e.g., a robots.txt file –Block you if you don’t specify a User Agent –Identify you based on Your IP Your User Agent

8 May-20-10CS572-Summer2010CAM-8 Politeness Queuing is very important Maintain host-specific crawl patterns and policies –Sub-collection based using regex Threading and brute-force is your enemy Respect robots.txt Declare who you are

9 May-20-10CS572-Summer2010CAM-9 Crawl Scheduling When and where should you crawl –Based on URL freshness within some N day cycle? Relies on unique identification of URLs and approaches for that –Based on per-site policies? Some sites are less busy at certain times of the day Some sites are on higher bandwidth connections than others Profile this? Adaptative fetching/scheduling –Deciding the above on the fly while crawling Regular fetching/scheduling –Profiling the above and storing it away in policy/config

10 May-20-10CS572-Summer2010CAM-10 Data Transfer Download in parallel? Download sequentially? What to do with the data once you’ve crawled in, is it cached temporarily or persisted somewhere?

11 May-20-10CS572-Summer2010CAM-11 Identification of Crawl Path Uniform Resource Locators Inlinks Outlinks Parsed data –Source of inlinks, outlinks Identification of URL protocol schema/path –Deduplication

12 May-20-10CS572-Summer2010CAM-12 File-based Crawlers Crawling remote content, getting politeness down, dealing with protocols, and scheduling is hard! Let some other component do that for you –CAS Pushpull great ex. –Staging areas, delivery protocols Once you have the content, there is still interesting crawling strategy

13 May-20-10CS572-Summer2010CAM-13 What’s hard? The file is already here Identification of which files are important, and which aren’t –Content detection and analysis MIME type, URL/filename regex, MAGIC detection, XML root chars detection, combinations of them Apache Tika Mapping of identified file types to mechanisms for extracting out content and ingesting it

14 May-20-10CS572-Summer2010CAM-14 Quick intro to content detection By URL, or file name –People codified classification into URLs or file names –Think file extensions By MIME Magic –Think digital signatures By XML schemas, classifications –Not all XML is created equally By combinations of the above

15 May-20-10CS572-Summer2010CAM-15 Case Study: OODT CAS Set of components for science data processing Deals with file-based crawling

16 May-20-10CS572-Summer2010CAM-16 File-based Crawler Types Auto- detect Met Extractor Std Product Crawler

17 May-20-10CS572-Summer2010CAM-17 Other Examples of File Crawlers Spotlight –Indexing your hard drive on Mac and making it readily available for fast free-text search –Involves CAS/Tika like interactions Scripting with ls and grep –You may find yourself doing this to run processing in batch, rapidly and quickliy –Don’t encode the data transfer into the script! Mixing concerns

18 May-20-10CS572-Summer2010CAM-18 Challenges Reliability –If crawl fails during web-scale crawl, how do you mitigate? Scalability –Web-based vs. file based Commodity versus appliance –Google or build your own Separation of concerns –Separate processing from ingestion from acquisition

19 May-20-10CS572-Summer2010CAM-19 Wrapup Crawling is a canonical piece of a search engine Utility is seen in data systems across the board Determine what your strategy for acquisition vis a vis your processing and ingestion strategy is Separate and insulate Identify content flexibly


Download ppt "Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010."

Similar presentations


Ads by Google