Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawlers - Presentation 2 - April 20081 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Similar presentations


Presentation on theme: "Crawlers - Presentation 2 - April 20081 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch."— Presentation transcript:

1 Crawlers - Presentation 2 - April 20081 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

2 Crawlers - Presentation 2 - April 20082 Crawlers 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions

3 Crawlers - Presentation 2 - April 20083 Crawlers – Background  What is a crawler?  Collect information about internet pages  Near-infinite amount of web pages, no directory  Use links contained within pages to find out about new pages to visit  How do crawlers work?  Pick a starting page URL (seed)  Load starting page from internet  Find all links in page and enqueue them  Get any desired information from page  Loop

4 Crawlers - Presentation 2 - April 20084 Crawlers – Background  Rules which apply on the Domain:  All crawlers have a URL Fetcher  All crawlers have a Parser (Extractor)  Crawlers are a Multi Threaded processes  All crawlers have a Crawler Manager  All crawlers have a Queue structure  Strongly related to the search engine domain

5 Crawlers - Presentation 2 - April 20085 5 Unified Domain Class Diagram * Common features ExternalDB Merger DB PageData CrawlerHelper Filter *Added by code modeling StorageManager Spider SpiderConfig Queue Thread Extractor Fetcher Robots Scheduler

6 Crawlers - Presentation 2 - April 20086 6 Unified Domain Sequence Diagram Pre-crawling phase:Pre-fetching phase: Main loop  Optional objects! Fetching and extracting phase: Optional object! Post-processing phase:Finish crawling phase: End of main loop 

7 Crawlers - Presentation 2 - April 20087 Unified Domain - Applications  For the User Modeling group, the applications were the first chance to see things in practice  For the entire group, the applications provided a fresh view about the domain, which led to many changes (Assignment 2)  With everyone viewing the applications in the domain context, most differences were explained as being application-specific  Interesting experiment: Let new Code Modeling group use applications as basis for domain?

8 Crawlers - Presentation 2 - April 20088 WebSphinx  WebSphinx: Website-Specific Processors for HTML INformation eXtraction (2002)  The WebSphinx class library provides support for writing web crawlers in Java  Designation: Small-scope crawls for mirroring, offline viewing, hyperlink trees  Extensible to saving information about page elements

9 Crawlers - Presentation 2 - April 20089 WebSphinx Hyperlink Tree

10 Crawlers - Presentation 2 - April 200810 WebSphinx Extractor Scheduler Settings Link Spider, Queue (Configuration) Fetcher, PageData, StorageManager Mirror Element Thread Robots Filters Mirror: A collection of files (Pages) intended to provide a perfect copy of another website Element: Web pages are composed of many elements ( ). Elements can be nested (For example, will have many child elements) Link: A link is a type of element, usually, which points to a specific page or file. Storing information about each link relative to our seeds can help us analyze results

11 Crawlers - Presentation 2 - April 200811 WebSphinx

12 Crawlers - Presentation 2 - April 200812 Web Lech  Web Lech allows you to "spider" a website and to recursively download all the pages on it.

13 Crawlers - Presentation 2 - April 200813 Web Lech  Web Lech is a fully featured web site download/mirror tool in Java, which supports :  download websites  emulate standard web-browser behavior Web Lech is multithreaded and will feature a GUI console.

14 Crawlers - Presentation 2 - April 200814 Web Lech  Open Source MIT License means it's totally free and you can do what you want with it  Pure Java code means you can run it on any Java- enabled computer  Multi-threaded operation for downloading lots of files at once  Supports basic HTTP authentication for accessing password-protected sites  HTTP referrer support maintains link information between pages (needed to Spider some websites)

15 Crawlers - Presentation 2 - April 200815 Web Lech  Lots of configuration options:  Depth-first or breadth-first traversal of the site  Candidate URL filtering, so you can stick to one web server, one directory, or just Spider the whole web  Configurable caching of downloaded files allows restart without needing to download everything again  URL prioritization, so you can get interesting files first and leave boring files till last (or ignore them completely)  Check pointing so you can snapshot spider state in the middle of a run and restart without lots of processing.

16 Crawlers - Presentation 2 - April 200816 Class Diagram

17 Crawlers - Presentation 2 - April 200817

18 Crawlers - Presentation 2 - April 200818

19 Crawlers - Presentation 2 - April 200819 Sequence Diagram

20 Crawlers - Presentation 2 - April 200820

21 Crawlers - Presentation 2 - April 200821 Common Features

22 Crawlers - Presentation 2 - April 200822 Common Features

23 Crawlers - Presentation 2 - April 200823 Unique Features

24 Crawlers - Presentation 2 - April 200824 Grub Crawler  A Little bit about NASA’s SETI  What are distributed Crawlers?  Why distributed Crawlers?  Pros & Cons of distributed Crawlers

25 Crawlers - Presentation 2 - April 200825 Class Diagram

26 Crawlers - Presentation 2 - April 200826 Class Diagram (2) Spider & Thread Config & Robot

27 Crawlers - Presentation 2 - April 200827 Class Diagram (3) Fetcher Extractor Queue & Storage Manager

28 Crawlers - Presentation 2 - April 200828 Sequence Diagram

29 Crawlers - Presentation 2 - April 200829 Sequence Diagram

30 Crawlers - Presentation 2 - April 200830 Use Case

31 Crawlers - Presentation 2 - April 200831 Aperture  Developing Year: 2005  Designation: crawling and indexing  Crawl different information systems  Many common file formats  Flexible architecture  Main process phases:  Fetch information from a chosen source  Identify source type (MIME protocol)  Full-text and metadata extraction  Store and index information

32 Crawlers - Presentation 2 - April 200832 Crawlers - Presentation 2 - April 2008 32 Aperture Web Demo Go to: http://www.dfki.unikl.de/ApertureWebProject/http://www.dfki.unikl.de/ApertureWebProject/

33 Crawlers - Presentation 2 - April 200833 Crawlers - Presentation 2 - April 2008 33 Aperture Class Diagram Aperture offers a crawler for each data source. Our domain focus on web !crawling Aperture offers many extractors which are able to extract data and metadata from files,email,sites,calendars etc. CrawlReport Mime DataObject RDFContainer StorageManager Spider, SpiderConfig, Queue Thread,Scheduler,Robots Fetcher,CrawlerHelper DB CrawlerHelper Extractor CrawlerTypes Extractor Types Classes name:  DataObject  RDFContainer Aperture’s unique! Roll: Represnet a source object after fetching it. Object includes source data and metadata in a RDF format. Class name:  Mime Aperture’s unique! Roll: Identify source type in order to choose the correct extractor. Interface name:  CrawlReport Aperture’s unique! Roll: Help crawler to keep necessary information about crawling changing status, fails and successes

34 Crawlers - Presentation 2 - April 200834 Crawlers - Presentation 2 - April 2008 34 Aperture Sequence Diagram

35 Crawlers - Presentation 2 - April 200835 Summary - ADOM  ADOM was helpful in establishing domain requirements  With better understanding of ADOM, abstraction became easier – level of abstraction was improved (increased) with each assignment  Using XOR and OR limitations on relations helpful in creating domain class diagram  Difficult not to get carried away with “It’s optional, no harm in adding it” decisions

36 Crawlers - Presentation 2 - April 200836 Summary – Domain Modeling  Difficulty in modeling functional entities – functions are often contained within another class  Difficult to model when many optional entities exist, some of which heavily impact class relations and sequences  Vast difference in application scale  Next time, we’ll pick a different domain…

37 Crawlers - Presentation 2 - April 200837 Crawlers  Thank you  Any questions?


Download ppt "Crawlers - Presentation 2 - April 20081 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch."

Similar presentations


Ads by Google