Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010 1Web Categorization.

Similar presentations


Presentation on theme: "Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010 1Web Categorization."— Presentation transcript:

1 Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010 1Web Categorization Crawler

2 Contents  Crawler Overview  Introduction and Basic Flow  Crawling Problems  Project Technologies  Project Main Goals  System High Level Design  System Design  Crawler Application Design  Frontier Structure  Worker Structure  Database Design - ERD of DB  Storage System Design  Web Application GUI  Summary 2Web Categorization Crawler

3 Crawler Overview – Intro.  A Web Crawler is a computer program that browses the World Wide Web in a methodical automated manner  The Crawler starts with a list of URLs to visit, called the seeds list  The Crawler visits these URLs and identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the frontier  URLs from the frontier are recursively visited according to a predefined set of policies Web Categorization Crawler3

4 Crawler Overview – Basic Flow  The basic flow of a standard crawler is as seen in the illustration and as follows:  The Frontier, that contains the URLs to visit, is Initialized with seed URLs  A URL is picked from the frontier and the page with that URL is fetched from the internet  The page that has been fetched is parsed in order to:  Extract hyperlinks from the page  Process the page  Add extracted URLs to Frontier Web Categorization Crawler4

5 Crawling Problems Web Categorization Crawler5  The World Wide Web contains a large volume of data  Crawler can only download a fraction of the Web pages  Thus there is a need to prioritize and speed up downloads, and crawl only the relevant pages  Dynamic page generation  May cause duplication in content retrieved by the crawler  Also causes a crawler traps  Endless combination of HTTP requests to the same page  Fast rate of Change  Pages that were downloaded may have been changed since the last time they were visited  Some crawlers may need to revisit the pages in order to keep up to date data

6 Project Technologies  C# (C Sharp), a simple, modern, general-purpose, and object oriented programming language  ASP.NET, a web application framework  Relational Data Base  SQL, a database computer language for managing data  SVN, a revision control system to maintain current and historical versions of files Web Categorization Crawler6

7 Project Main Goals Web Categorization Crawler7  Design and implement a scalable and extensible crawler  Multi-threaded design in order to utilize all the system resources  Increase the crawler’s performance by implementing an efficient algorithms and data structures  The Crawler will be designed in a modular way, with expectation that new functionality will be added by others  Build a friendly web application GUI including all the features supported for the crawl progress

8 Main GUI System High Level Design Web Categorization Crawler8 Storage System Data Base Crawler Frontier worker1 worker2 worker3...... View results Store Configurations Load Configurations Store Results  There are 3 major parts in the System  Crawler (Server Application)  StorageSystem  Web Application GUI (User)

9 Crawler Application Design  Maintains and activates both of the Frontier and the Workers  The Frontier is the data structure that holds the urls to visit  A Worker’s role is to fetch and process pages  Multi Threaded  There are predefined number of Worker threads  There is a single Frontier thread  Requires to protect the shared resources from simultaneous access  The shared resource between the Workers and the Frontier is the Queue that holds the urls to visit Web Categorization Crawler9

10 Frontier Structure  Maintains the data structure that contains all the Urls that have not been visited yet  FIFO Queue *  Distributes the Urls uniformly between the workers Web Categorization Crawler10 Frontier Queue Worker Queues (*) first implementation F Is Seen Test Route Request T Delete Request

11 Worker Structure  The Worker fetches a page from the Web and processes the fetched page with the following steps:  Extracting all the Hyper links from the page.  Filtering part of the extracted Urls.  Ranking the Url*  Categorizing the page*  Writing the results to the data base.  Writing back the extracted urls back to the frontier. Web Categorization Crawler11 Fetcher Categorizer URL filter Extractor Page Ranker DB (*) will be implemented at part II Worker Queue Frontier Queue

12 Class Diagram of Worker Web Categorization Crawler12

13 Class Diagram Of Worker-Cont. Web Categorization Crawler13

14 Class Diagram Of Worker-Cont. Web Categorization Crawler14

15 ERD of Data Base  Tables in the Data Base:  Task, contains basic details about the task  TaskProperties, contains the following properties about a task : Seed list, allowed networks, restricted networks*  Results, contains details about the results that the crawler have reached to them  Category, contains details about all the categories that have been defined  Users, contains details  about the users of the system** Web Categorization Crawler15 (*) Any other properties can be added and used easily (**) Not used in the current GUI

16 Storage System  Storage System is the connector class between the GUI and the Crawler to the DB  Using the Storage System you can save data into the data base, or you can extract data from the data base  The Crawler uses the Storage System to extract the configurations of a task from the DB, and to save the results to the DB  The GUI uses the Storage System to save configurations of a task into the DB, and to extract the results from the DB Web Categorization Crawler16

17 Class Diagram of Storage System Web Categorization Crawler17

18 Web Application GUI  Simple and Convenient to use  User Friendly  User can do the following:  Edit and create a task  Launch the Crawler  View the results that the crawler has reached  Stop the Crawler Web Categorization Crawler18

19 Web Categorization Crawler – Part II Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Spring 2009/10 Final Presentation Dec. 2010 19Web Categorization Crawler

20 Contents  Reminder From Part I  Crawler Overview  System High Level Design  Worker Structure  Frontier Structure  Project Technologies  Project Main Goals  Categorizing Algorithm  Ranking Algorithm  Motivation  Background  Ranking Algorithm  Frontier Structure – Enhanced  Ranking Trie  Basic Flow  Summary 20Web Categorization Crawler

21 Reminder: Crawler Overview  A Web Crawler is a computer program that browses the World Wide Web in a methodical automated manner  The Crawler starts with a list of URLs to visit, called the seeds list  The Crawler visits these URLs and identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the frontier  URLs from the frontier are recursively visited according to a predefined set of policies Web Categorization Crawler21

22 Main GUI Reminder: System High Level Design Web Categorization Crawler22 Storage System Data Base Crawler Frontier worker1 worker2 worker3...... View results Store Configurations Load Configurations Store Results  There are 3 major parts in the System  Crawler (Server Application)  StorageSystem  Web Application GUI (User)

23 Reminder: Worker Structure  The Worker fetches a page from the Web and processes the fetched page with the following steps:  Extracting all the Hyper links from the page.  Filtering part of the extracted URLs.  Ranking the URL  Categorizing the page  Writing the results to the data base.  Writing back the extracted urls back to the frontier. Web Categorization Crawler23 Fetcher Categorizer URL filter Extractor Page Ranker DB Worker Queue Frontier Queue

24 Reminder: Frontier Structure  Maintains the data structure that contains all the Urls that have not been visited yet  FIFO Queue *  Distributes the Urls uniformly between the workers Web Categorization Crawler24 Frontier Queue Worker Queues (*) first implementation F Is Seen Test Route Request T Delete Request

25 Project Technologies  C# (C Sharp), a simple, modern, general-purpose, and object oriented programming language  ASP.NET, a web application framework  Relational Data Base  SQL, a database computer language for managing data  SVN, a revision control system to maintain current and historical versions of files Web Categorization Crawler25

26 Project Main Goals Web Categorization Crawler26  Support Categorization of the web pages, which tries to match the given content to predefined categories  Support Ranking of the web pages, which means building a ranking algorithm that evaluates the relevance (rank) of the extracted link based on the content of the parent page  A new implementation of the frontier, which passes on the requests according to their rank, should be fast and memory efficient data structure

27 Categorization Algorithm Web Categorization Crawler27  Tries to match the given content to predefined categories  Every category is described by a list of keywords  The final match result has two factors:  Match Percent which describes the match between the category keywords and the given content:  Non-Zero match which describes how many different keywords appeared in the content:  The total match level of the content to category is obtained from the sum of the two factors aforementioned : * each keyword has max limit of how many times it can appear, any additional appearances won’t be counted

28 Categorization Algorithm cont.  Overall Categorization progress when matching a certain page to specific category Web Categorization Crawler28 Page Content WordList NonZero Calculator Matcher Calculator Category Keywords Keyword1 Keyword2 Keyword3. Keyword n NonZero Bonus Match Percent Total Match Level.

29 Ranking Algorithm - Motivation Web Categorization Crawler29  The World Wide Web contains a large volume of data  Crawler can only download a fraction of the Web pages  Thus there is a need to prioritize downloads, and crawl only the relevant pages  Solution:  To give every extracted url a rank according to it’s relevance to the categories that defined by the user  The frontier will pass on the urls with higher rank  Relevant pages will be visited first  The quality of the Crawler depends on the correctness of the ranker

30 Ranking Algorithm - Background Web Categorization Crawler30  Ranking is a kind of prediction  The Rank must be given to the url when it is extracted from a page  It is meaningless to give the page a rank after we have downloaded it  The content of the url is unavailable when it is extracted  The crawler didn’t download it yet  The only information that we can assist of, when the url is extracted, is the page from which the url has been extracted (aka the parent page)  Ranking will be done according to the following factors*  The rank given to the parent page  The relevance of the parent page content  The relevance of the nearby text content of the extracted url  The relevance of the anchor of the extracted url  Anchor is the text that appears on the link * Based on SharkSearch Algorithm

31 Ranking Algorithm – The Formula* Web Categorization Crawler31  Predicts the relevance of the content of the page of the url extracted  The final rank of the url depends on the following factors  Inherited, which describes the relevance of the parent page to the categories:  Neighborhood, which describes the relevance of the nearby text and the anchor of the url:  While ContextRank is given by:  The total rank given to the extracted url is obtained from the aforementioned factors: * Based on SharkSearch Algorithm

32 Frontier Structure – Ranking Trie Web Categorization Crawler32  A customized data structure that saves the url requests efficiently  Holds two sub data structures  Trie, a data structure that holds url strings efficiently for already seen test  RankTable, array of entries, each entry holds a list of all the url requests that have the same rank level which is specified by the array index  Supports url seen test in O(|urlString|),  every seen url is being saved in the trie  Supports passing on first the urls with higher rank in O(1)

33 Frontier Structure - Overall Web Categorization Crawler33  The Frontier is based on the RankingTrie data structure  Saves\updates all the new forwarded requests into the ranking trie  When a new url request arrives, the frontier just adds it to the RankingTrie  When the frontier need to route a request, it gets the high ranked request saved in the RankingTrie and routes it to the suitable worker queue Frontier Queue Worker Queues Ranking Trie Route Request

34 Summary  Goals achieved:  Understanding ranking methods  Especially the Shark Search  Implementing categorizing algorithm  Implementing efficient frontier which supports ranking  Implementing a multithreaded Web Categorization Crawler with full functionality Web Categorization Crawler34 (*) will be implemented at part II


Download ppt "Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010 1Web Categorization."

Similar presentations


Ads by Google