Master Thesis Defense Jan Fiedler 04/17/98

Master Thesis Defense Jan Fiedler 04/17/98
Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

Presentation Outline Resource Discovery Problem
Web Crawling Techniques Traditional Web Crawling Mobile Web Crawling Mobile Crawling Architecture Distributed Runtime Environment Application Framework Performance Evaluation Summary and Conclusion 04/17/98

Resource Discovery Problem
Web establishes large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a month exponential growing rate lack of structure (i.e. no strict hierarchy) Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery 04/17/98

Web Indices and Search Engines
Search engine statistics: index size million pages (approx. 700GB) web coverage 10%-35% daily crawl 3-10 million pages (approx. 60GB) Year 2000 estimates: index size 880 million pages (approx. 5.6TB) daily crawl 80 million pages (approx. 480GB) Traditional Web crawling will experience severe scaling problems in the near future. 04/17/98

Traditional Crawling Overview
04/17/98

Traditional Web Crawling
Characteristics of traditional Web crawling: remote data access focus on rapid data retrieval centralized, database oriented architecture brute force download of Web content resource intensive approach Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs. 04/17/98

Mobile Crawling Overview
04/17/98

Mobile Web Crawling Characteristics of mobile Web crawling:
local data access focus on effective data retrieval distributed, data source oriented architecture intelligent download of significant Web content resource preserving approach Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission 04/17/98

Mobile Crawling Advantages
Remote page selection determine significance of a page prior to transmission applicable for specialized search engines Remote page filtering use effective page representation model applicable for non-fulltext search engines Remote page compression compress page data prior to transmission applicable for all search engines 04/17/98

Crawler Specification
Rule based programming paradigm represent crawler data as facts (e.g. page-facts) describe crawler behavior as a set rules which operate upon facts Advantages it is easier to specify crawling rules than to devise a crawling algorithm no need to model control flow rule based programs have very simple runtime states 04/17/98

Mobile Crawling Architecture
04/17/98

Mobile Crawling Architecture
Distributed Crawler Runtime Environment provide platform independent execution environment virtual machine for remote crawler execution communication layer for crawler migration Application Framework support for crawler specification and configuration crawler manager for crawler specification query engine as crawler/application interface archive manager as database connectivity framework 04/17/98

Crawler Virtual Machine
How to execute a rule based crawler specification? crawler execution = rule application upon fact base use inference engine for the the rule application process 1. Initialization insert rules and facts into inference engine 2. Rule application start rule application process within inference engine 3. Finalization extract rules and facts once the rule application stopped 04/17/98

Crawler Virtual Machine
04/17/98

Crawler Query Engine How to access the crawler knowledge?
provide a query facility to query the crawler fact base implement a SQL subset as query language represent query result as data tuples, not as facts allows the user to reason about crawling results query engine implementation uses inference engine Query engine serves as the primary interface between the user application and the mobile crawler 04/17/98

Crawler Query Engine 04/17/98

Performance Evaluation Setup
Use distributed virtual machines to support mobile as well as traditional Web crawling 04/17/98

Performance Evaluation
Controlled environment setup static HTML data set with known properties personal HTTP server unshared communication channel (dialup line) Measurements 1. network load for traditional (stationary) crawler 2. network load for mobile crawler without page compression 3. network load for mobile crawler with page compression 04/17/98

Benefit of Remote Page Selection
Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection 04/17/98

Benefit of Remote Page Filtering
Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved) 04/17/98

Benefit of Page Compression
Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages 04/17/98

Costs and Benefits Overhead Benefits without page compression
overhead due to crawler migration (<5K) overhead due to facts based data representation (6%) Benefits without page compression as soon as less than 85% per page needs to be preserved as soon as less than 90% of all pages are transmitted Benefits with page compression reduction in network load by a factor of 4.5 04/17/98

Summary and Conclusion
Mobile crawling advantages: approach fits better in distributed web environment approach beneficial for all types of search engines better support for specialized search engines network overhead due to crawler mobility is small Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data. Approach provides a base for smart Web crawling. 04/17/98

Future Work Security Crawler mobility support
crawler identification based on digital signatures restrict crawler execution to positive identified crawlers implement virtual machine as a secure sandbox Crawler mobility support integrate virtual machine into web servers Mobile crawling algorithms optimize crawling algorithms with crawler mobility in mind (e.g. crawler communication) 04/17/98

Master Thesis Defense Jan Fiedler 04/17/98

Similar presentations

Presentation on theme: "Master Thesis Defense Jan Fiedler 04/17/98"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Master Thesis Defense Jan Fiedler 04/17/98

Similar presentations

Presentation on theme: "Master Thesis Defense Jan Fiedler 04/17/98"— Presentation transcript:

Similar presentations

About project

Feedback