Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Similar presentations


Presentation on theme: "Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch."— Presentation transcript:

1 Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

2 Crawlers - March 2008 2 Crawlers Introduction3 Crawler Basics4 Domain Terminology5 In-Depth Domain Elaboration6 Application Examples7 UM Domain Analysis8-33 CM Domain Analysis34-44 Lessens Learned45-46 Conclusion47-49

3 Crawlers - March 2008 3 Introduction  A little bit about search engines  How do search engines work?  Why are crawlers needed?  Many names – same meaning  crawler, spider, robot, BOT, Grub, spy  The Goggle Phenomenal  founders Larry Page and Sergey Brin, September 1998

4 Crawlers - March 20084 Crawler’s Basics  What is a crawler?  How do crawlers work?  Crawling web pages  What pages should the crawler download?  How should the crawler refresh pages?  How should the load on the visited Web sites be minimized?  How do Crawlers Index Web Pages?  Link indexing  Text indexing  How do Crawlers save Data?  Scalability: distribute the repository across a cluster of computers and disks  Large bulk updates: the repository needs to handle a high rate of modifications  Obsolete pages: must have a mechanism for detecting and removing obsolete pages.

5 Crawlers - March 20085 Domain Terminology  Link – an HTML code which redirects the user to a different web page  URL - universal Resource Locator. An Internet World Wide Web Address  Seeds – a set of URLs which are the crawler’s starting point  Parser – The element which is responsible for link extraction from pages  Thread – A dependent stack instance of the same process  Queue – The element which holds the retrieved URLs  Politeness Policy - a common set of rules which are intended to protect from over abusing the sites while crawling in them  Repository - the resource which stores all the crawler’s retrieved data

6 Crawlers - March 20086 Domain Elaboration  Rules which apply on the Domain:  All crawlers have a URL Fetcher  All crawlers have a Parser (Extractor)  Crawlers are a Multi Threaded processes  All crawlers have a Crawler Manager  Strongly related to the search engine domain

7 Crawlers - March 20087 Application Examples  Many different crawlers doing different things  WebCrawler  Google Crawler  Heritrix  Mirroring applications

8 Crawlers - March 20088 User Modeling

9 Crawlers - March 20089 User Modeling: Class Diagram Main Classes: Spider: The Spider is the Base component of the crawler, and while each spider has it own unique way of performing Most of the spider contains the same basic Features :

10 Crawlers - March 200810 User Modeling: Class Diagram Features:  Run/Kill : activation of the spider and deactivation. deactivation.  Update: updating running parameters. IN ORDER TO OF GETTING THE REQUESTED URL’S REQUESTED URL’S THE SPIDER USES :

11 Crawlers - March 200811 User Modeling: Class Diagram IN ORDER TO OF GETTING THE REQUESTED URL’S THE SPIDER USES : URL FETCH NOW : URL FETCH NOW : This is the basic class that actually Fetches the url’s The basic features are :  URLFetchNow : activation of the class.  Get/Fetch URL : gets the URL.

12 Crawlers - March 200812 User Modeling: Class Diagram To config the SPIDER’S parameters: SPIDER CONFIG: SPIDER CONFIG: This is the basic class that can set The SPIDER’S configuration and lets the SPIDER updates itself. Features: Set/Get Configuration.

13 Crawlers - March 200813 User Modeling: Class Diagram To Sort results we are going to need Some kind of a Data Structure. Most commune is a Queue: URL QUEUE HANDLER: URL QUEUE HANDLER: A class containing a queue or any kind of data Structure which sorts results. Features:Queue/Dequeue.

14 Crawlers - March 200814 User Modeling: Class Diagram In order to make search and result handling More efficient we are going to use an : INDEXER: INDEXER: The INDEXER is a class that sets the most Effective index and lets the spider use it And set it. Features: SET/GET INDEX().

15 Crawlers - March 200815 User Modeling: Class Diagram In order to control the SPIDER an entity has to get The access to kill it and create it, the to get The access to kill it and create it, the entity will be updated from the queue or the SCHEDUELER we are going to use a: CRAWLER MANAGER: The MANAGER is a class that is able to make the calls weather The spider is created or killed. Features: Update By Scheduler/queue : enables the queue/scheduler to inform the manager about on going activity.

16 Crawlers - March 200816 User Modeling: Class Diagram In most of the cases we are going to use a database to store our results, for this we’re going to use a Class that will communicate with the DB : STORAGE MANAGER: The STORAGE MANAGER is a class that will write the crawl result to the DB. write the crawl result to the DB.Features:  Sort Info () : the MANAGER will sort info previous to writing it To DB. Write To DB(): Write crawl results to DB.

17 Crawlers - March 200817

18 Crawlers - March 200818 User Modeling: Sequence(1) Getting Schedule: The MANAGER is Getting next schedule Crawler Manager Scheduler

19 Crawlers - March 200819 User Modeling: Sequence(2) Creating a new Spider: The MANAGER is Creating a new Spider Crawler Manager SPIDER

20 Crawlers - March 200820 User Modeling: Sequence(3) Creating a new search: The MANAGER is telling the SPIDER to start search SPIDER to start search Crawler Manager SPIDER

21 Crawlers - March 200821 User Modeling: Sequence(4) Getting an index: The SPIDER is Getting index for next crawl Crawler Manager SPIDER

22 Crawlers - March 200822 User Modeling: Sequence(5) Actual Fetching URL’S: The SPIDER is Activating URL fetching SPIDER URL FETCH NOW

23 Crawlers - March 200823 User Modeling: Sequence(6) Queuing results: The SPIDER is Sending results to queue queueSPIDER URL QUEUE HANDLER

24 Crawlers - March 200824 User Modeling: Sequence(7) DeQueuing results: The SPIDER is Dequeuing sorted resultsSPIDER URL QUEUE HANDLER

25 Crawlers - March 200825 User Modeling: Sequence(8) Writing to DB: The SPIDER is Sending sorted results to DB SPIDER STORAGE HANDLER

26 Crawlers - March 200826 User Modeling: Sequence(9) Update Scheduler: The Queue Handler updates the Scheduler QUEUE HANDLER SCHEDULER

27 Crawlers - March 200827 User Modeling: Sequence(10) Update Manager: The Scheduler updates The manager SCHEDULER CRAWLER MANAGER

28 Crawlers - March 200828 User Modeling: Sequence(11) Kill SPIDER: The Manager kills the SPIDER In the end of the process CRAWLER MANAGER SPIDER

29 Crawlers - March 200829

30 Crawlers - March 200830 Domain Patterns  How can a crawler cope with new page standards conventions?  Fatch new standard pages  Index new standard pages  Factory Design Pattern

31 Crawlers - March 200831 Domain Patterns (2)  The parser class as a Factory design  parse different pages: HTML, PDF, Word est.  The UML Fatcher class as a Factory Design  Fatchs pages from different protocols and conventions: UDP, TCP/IP, FTP, IP6  How to ensure we have only one Crawler manager Queue and repository?  Singleton Design Pattern

32 Crawlers - March 200832 User Modeling: Lessons  A Problem :Little Info or Too much Info ?  Scoping :Where does a Crawler begins and where does it ends ?  What is a general feature and what is a specific feature?  Code varies more the Domain.  Auto Reverse Engineering or manual ?

33 Crawlers - March 200833 User Modeling: Lessons  A Problem :Little Info or Too much Info ?  Scoping :Where does a Crawler begins and where does it ends ?  What is a general feature and what is a specific feature?  Code varies more the Domain.  Auto Reverse Engineering or manual ?

34 Crawlers - March 200834 Code Modeling

35 Crawlers - March 200835 Code Modeling – Reverse Engineering – Applications (1)  Applications which were R.E.’d:  Arale, WebEater – Basic web crawlers for file downloading (for offline viewing)  JoBo – Advanced web crawler for file downloading (for offline viewing)  Heritrix – Advanced distributed crawler for file downloading (to archives)  HyperSpider – Basic crawler for displaying hyperlink trees

36 Crawlers - March 200836 Code Modeling – Reverse Engineering - Applications (2)  Nutch (Lucerne) – Advanced distributed crawler / search engine for indexing  WebSphinx – Crawler framework for mirroring and hyperlink tree display  Aperture - Advanced crawler able to read HTTP, FTP, local files, for indexing

37 Crawlers - March 200837 Code Modeling – Reverse Engineering – CASE Tool  Reverse Engineering using Visual Paradigm for UML  Used only for class diagrams – use case + sequence were modeled by hand based on classes, usage and documentation  Good results for small applications, poor results for large applications (too much noise made signal hard to find)

38 Crawlers - March 200838 Application class: A single class containing the main application elements, starts the crawling sequence based on parameters Page Manager (Page): Class holding all data relevant to a web (or local) page, may save entire page or only summary / relevant parts Parameters: Class holding parameters required for the application to run Robots: Class containing information on pages the crawler may not visit Queue: Class containing a list of links (pages) the crawler should visit Thread: Class containing information required for each crawler thread Listener: Class responsible for receiving pages from the internet Extractor: Class responsible for parsing pages and extracting links for queue Filters: Classes responsible for deciding if a link should be queued or visited Helpers: Classes responsible for helping the crawler deal with forms, cookies, etc. DB / Merger / External DB: Classes required for saving data into databases for local / distributed applications with DBs

39 Crawlers - March 200839 Code Modeling – Sequence (1)

40 Crawlers - March 200840 Code Modeling – Sequence (2)

41 Crawlers - March 200841 Code Modeling – Sequence (3)

42 Crawlers - March 200842 Code Modeling – Sequence (4)

43 Crawlers - March 200843 Code Modeling – Results Example

44 Crawlers - March 200844 Code Modeling - Conclusions  Very difficult to reach domain-level abstraction based on code modeling  VP not very helpful in dealing with large applications (clutter)  Difficult to understand sequences and use cases correctly (no R.E. at all)  Documentation was often the most helpful tool for code modeling, rather than R.E.

45 Crawlers - March 200845 Domain Modeling with ADOM  ADOM was helpful in establishing domain requirements  Difficult to model when many optional entities exist, some of which heavily impact class relations and sequences  ADOM was not very helpful with abstraction, but that may be a function of the domain itself (functional)  End results difficult to read, but seem to provide a good domain framework for applications

46 Crawlers - March 200846 Domain Problems and Issues  Crawler domain contains many functional entities which do not necessarily store information (difficult to model)  Many optional controller / manager entities (clutter with relations)  Vast difference in application scale  Entity / function containment

47 Crawlers - March 200847 Future Work (1) Merging Code Modeling and User Modeling will be difficult:  User modeling focused mostly on large- scale crawlers (research focuses on these)  Mostly from a search engine perspective  Schedule-oriented  High level of abstraction

48 Crawlers - March 200848 Future Work (2)  Code modeling focused mostly on smaller applications (easier to model, available)  Focus mostly on archival / mirroring  User-oriented  Medium level of abstraction

49 Crawlers - March 200849 Future Work (3)  Merged product entities should be closer to User Modeling than Code Modeling (Higher level of abstraction)  User vs. schedule  Indexing vs. archiving  Importance of optional entities

50 Crawlers - March 200850 Web Crawlers Domain  Thank you  Any questions?


Download ppt "Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch."

Similar presentations


Ads by Google