Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida

Similar presentations


Presentation on theme: "Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida"— Presentation transcript:

1 Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

2 17th Annual AoM/IAoM Joachim Hammer - UF2 Presentation Outline pProblem Statement and Goal pWeb Crawling Techniques p Traditional Web Crawling p Mobile Web Crawling pMobile Crawling Architecture p Distributed Runtime Environment p Application Framework pPerformance Evaluation pSummary and Future Work

3 17th Annual AoM/IAoM Joachim Hammer - UF3 What’s Wrong with the Web? pWeb represents large distributed hypertext system p 1.6 million Web sites p 320 million Web documents p 40% of the Web content changes within a month p Exponential growth rate pLacks structure (i.e. no strict hierarchy)

4 17th Annual AoM/IAoM Joachim Hammer - UF4 Web Indices and Search Engines pSearch engine statistics: p Index size: 30-110 million pages (approx. 700GB) p Web coverage: 10%-35% and decreasing! p Daily crawl:3-10 million pages (approx. 60GB) pYear 2000 estimates: p Index size880 million pages (approx. 5.6TB) p Daily crawl80 million pages (approx. 480GB) pTraditional Web crawling will experience severe scaling problems in the near future

5 17th Annual AoM/IAoM Joachim Hammer - UF5 Goals Long-term pOverlay the distributed Web structure with a centralized information system which allows efficient resource discovery p Turn Web into an effectively organized and cataloged “digital library” p Topic-specific search engines, e.g., self-health care, consumer electronics, etc. Project pFind an alternative to the current “brute-force” approach to Web crawling/indexing

6 17th Annual AoM/IAoM Joachim Hammer - UF6 Traditional Web Crawling Approach Based on the Google search engine (www.google.com)

7 17th Annual AoM/IAoM Joachim Hammer - UF7 Traditional Web Crawling pCharacteristics of traditional Web crawling: p Remote data access p Focus on rapid data retrieval p Centralized, database oriented architecture p  Resource intensive pTraditional Web crawling techniques do not exploit information about the pages being crawled p“Download first–process later” approach

8 17th Annual AoM/IAoM Joachim Hammer - UF8 (Our) Mobile Crawling Approach

9 17th Annual AoM/IAoM Joachim Hammer - UF9 Mobile Web Crawling pCrawler code migrates to host sites where pages are located pMobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission pCharacteristics: p Focus on effective data retrieval p Distributed, data source oriented architecture: access data where it is stored p Intelligent downloading of only relevant Web content p Resource preserving approach p“Process first–download later” approach

10 17th Annual AoM/IAoM Joachim Hammer - UF10 State-of-the-Art in Mobile Crawling pMany search engines, little information p e.g., WWW Worm, WebCrawler, Lycos, Altavista, Infoseek, Excite, HotBot, Google pDistributed search engines p e.g., Harvest pMobile code p Simplest form: downloadable Java applets p Code migration, software agents–very active research area p e.g, IBM Aglet infrastructure pCrawling algorithms

11 17th Annual AoM/IAoM Joachim Hammer - UF11 Mobile Crawling Architecture

12 17th Annual AoM/IAoM Joachim Hammer - UF12 Architecture Highlights pDistributed Crawler Runtime Environment p Platform independent execution environment for crawlers p Virtual machine for remote crawler execution p Communication layer to provide crawler transport service pApplication Framework p Communication layer to provide crawler transport service p Crawler manager to support crawler creation and configuration, controls crawler migration p Web site selection p Query engine as crawler/application (database) interface p Archive manager as database connectivity framework

13 17th Annual AoM/IAoM Joachim Hammer - UF13 A Word About Mobile Crawlers pCrawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) p Use CLIPS to represent crawler data (e.g., page-facts) and user-defined crawling strategies (as rules) p CLIPS - C Language Integrated Production System pAdvantages of rule-base approach p Easier to specify crawling rules than to devise a crawling algorithm p No need to model control flow p Rule-based programs have simple runtime states

14 17th Annual AoM/IAoM Joachim Hammer - UF14 Crawling Strategies pGeneral-purpose search engines use simple strategies p Crawl and index all pages, e.g., depth-first pFor subject-specific crawling, strategy is important pFind as many of the important pages while crawling the fewest number of pages overall pPage importance [Cho et al., Stanford University] p Keyword frequency and location p Backlink count p PageRank pFigure out when to return to crawler manager p Memory management issue

15 17th Annual AoM/IAoM Joachim Hammer - UF15 Crawler Virtual Machine pHow to execute a rule based crawler specification? p Crawler execution = rule application upon fact base p Use inference engine (JESS) for the rule application process p JESS is platform independent and extensible 1.Initialization p Insert rules and facts into inference engine 2.Rule application p Start rule application process within inference engine 3.Finalization p Extract rules and facts once the rule application stopped p Store back into crawler

16 17th Annual AoM/IAoM Joachim Hammer - UF16 Crawler Virtual Machine

17 17th Annual AoM/IAoM Joachim Hammer - UF17 Crawler Manager pCentralized control unit for mobile crawlers pCreate new crawlers based on user-defined crawler specification p Provide initial destination using seed URLs pDetermine subsequent itinerary for crawlers p Requires knowledge about mobile-crawler-enabled sites p Estimate the importance of Web sites, e.g., Backlink count, hot page count,... pCoordinate many crawlers running in parallel

18 17th Annual AoM/IAoM Joachim Hammer - UF18 Crawler Query Engine pUsed to access crawler contents after returning p “Hot pages” pCharacteristics p Provide a query facility to query the crawler fact base p Implement an SQL subset as query language p Represent query result as data tuples, not as facts p Allows the user to reason about crawling results p Query engine implementation uses inference engine p Query engine serves as the primary interface between the user application and the mobile crawler p SQL Database which holds Web index

19 17th Annual AoM/IAoM Joachim Hammer - UF19 Crawler Query Engine Crawler Object Query Engine User Query Compiler Query Rule Inference Engine Result Tuples Crawler Facts Crawler Rules

20 17th Annual AoM/IAoM Joachim Hammer - UF20 Mobile Crawling Advantages pRemote page selection p Determine significance of a page prior to transmission p Applicable for topic-specific search engines pRemote page filtering p Control the granularity of the retrieved data p Applicable for non-fulltext search engines pRemote page compression p Compress page data prior to transmission p Applicable for all search engines

21 17th Annual AoM/IAoM Joachim Hammer - UF21 Performance Evaluation Setup pTwo virtual machines (local and remote) plus crawler management system p Set up for mobile as well as traditional Web crawling

22 17th Annual AoM/IAoM Joachim Hammer - UF22 Performance Evaluation pFocus on proving viability of mobile crawling approach p Not focusing on analyzing crawling strategies [Cho97] pControlled environment setup p Static HTML data set with known properties - subset of University of Florida intranet p Apache HTTP server, unshared communication channel p Breadth-first crawling strategy - predictable crawler behavior pMeasurements 1.Network load for traditional (stationary) crawler 2.Network load for mobile crawler without page compression 3.Network load for mobile crawler with page compression

23 17th Annual AoM/IAoM Joachim Hammer - UF23 Benefit of Remote Page Selection pTraditional crawler (S1) versus mobile crawlers (M1- M4) with different keyword sets for page selection

24 17th Annual AoM/IAoM Joachim Hammer - UF24 Benefit of Remote Page Filtering pMobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved)

25 17th Annual AoM/IAoM Joachim Hammer - UF25 Benefit of Page Compression pTraditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages

26 17th Annual AoM/IAoM Joachim Hammer - UF26 Cost Benefit Analysis pOverhead p Overhead due to crawler migration (<5K) p Overhead due to fact-based data representation (  6%) pBenefits without page compression p As soon as less than 85% per page needs to be preserved p As long as less than 90% of all pages are transmitted pBenefits with page compression p Reduction in network load by a factor of 4.5

27 17th Annual AoM/IAoM Joachim Hammer - UF27 Summary and Conclusion pMobile crawling advantages: p Natural fit for distributed web environment p Well suited for topic-specific search engines p Small network overhead due to crawler mobility pSolves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely pApproach provides a base for smart Web crawling pCurrently improving crawler memory management pCompleting more realistic testbed consisting of ~10 mobile crawler-enabled Web sites within UF intranet

28 17th Annual AoM/IAoM Joachim Hammer - UF28 Ongoing/Future Work pSecurity p Crawler identification based on digital signatures p Restrict crawler execution to positive identified crawlers p Implement virtual machine as a secure sandbox pCrawler mobility support p Integrate virtual machine into web servers p Comparison with other infrastructures, e.g, IBM Aglet infrastructure (currently ongoing) pMobile crawling algorithms (currently ongoing) p Optimize crawling algorithms, site relocation algorithms p Carry out analysis


Download ppt "Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida"

Similar presentations


Ads by Google