Master Thesis Defense Jan Fiedler 04/17/98

Slides:



Advertisements
Similar presentations
Mobile Agents Mouse House Creative Technologies Mike OBrien.
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
1 Jens Hartmann Senior Researcher Ericsson Eurolab Deutschland GmbH Germany Bremen, Januar 2001 Invited Talk MCAP - agent-based.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Distributed components
Information Retrieval in Practice
Search Engines and Information Retrieval
Web Server Hardware and Software
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
Yimam & Kobsa July 13, 2000TWIST 2000 Centralization vs. Decentralization Issues in Internet-based KMS: Experiences from Expertise Recommender Systems.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Overview of Search Engines
Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida
Databases & Data Warehouses Chapter 3 Database Processing.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Graph-RAT Overview By Daniel McEnnis. 2/32 What is Graph-RAT  Relational Analysis Toolkit  Database abstraction layer  Evaluation platform  Robustly.
Databases and the Internet. Lecture Objectives Databases and the Internet Characteristics and Benefits of Internet Server-Side vs. Client-Side Special.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 07. Review Architectural Representation – Using UML – Using ADL.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Introduction to the Adapter Server Rob Mace June, 2008.
1 XML Based Networking Method for Connecting Distributed Anthropometric Databases 24 October 2006 Huaining Cheng Dr. Kathleen M. Robinette Human Effectiveness.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.
Module 10 Administering and Configuring SharePoint Search.
Page 1 Alliver™ Page 2 Scenario Users Contents Properties Contexts Tags Users Context Listener Set of contents Service Reasoner GPS Navigator.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
Copyright © cs-tutorial.com. Overview Introduction Architecture Implementation Evaluation.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Chapter 12 Develop the Knowledge Management System.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Fundamentals of Information Systems, Second Edition 1 Telecommunications, the Internet, Intranets, and Extranets.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
IT and Network Organization Ecommerce. IT and Network Organization OPTIMIZING INTERNAL COLLABORATIONS IN NETWORK ORGANIZATIONS.
Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Data mining in web applications
Efficient Multi-User Indexing for Secure Keyword Search
Joseph JaJa, Mike Smorul, and Sangchul Song
CHAPTER 3 Architectures for Distributed Systems
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Database management concepts
IS 4506 Server Configuration (HTTP Server)
Unit# 5: Internet and Worldwide Web
17th APAN Meetings & Joint Techs Workshop
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Course Instructor: Supriya Gupta Asstt. Prof
Presentation transcript:

Master Thesis Defense Jan Fiedler 04/17/98 Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98

Presentation Outline Resource Discovery Problem Web Crawling Techniques Traditional Web Crawling Mobile Web Crawling Mobile Crawling Architecture Distributed Runtime Environment Application Framework Performance Evaluation Summary and Conclusion 04/17/98 jfiedler@cise.ufl.edu

Resource Discovery Problem Web establishes large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a month exponential growing rate lack of structure (i.e. no strict hierarchy) Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery 04/17/98 jfiedler@cise.ufl.edu

Web Indices and Search Engines Search engine statistics: index size 30-110 million pages (approx. 700GB) web coverage 10%-35% daily crawl 3-10 million pages (approx. 60GB) Year 2000 estimates: index size 880 million pages (approx. 5.6TB) daily crawl 80 million pages (approx. 480GB) Traditional Web crawling will experience severe scaling problems in the near future. 04/17/98 jfiedler@cise.ufl.edu

Traditional Crawling Overview 04/17/98 jfiedler@cise.ufl.edu

Traditional Web Crawling Characteristics of traditional Web crawling: remote data access focus on rapid data retrieval centralized, database oriented architecture brute force download of Web content resource intensive approach Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs. 04/17/98 jfiedler@cise.ufl.edu

Mobile Crawling Overview 04/17/98 jfiedler@cise.ufl.edu

Mobile Web Crawling Characteristics of mobile Web crawling: local data access focus on effective data retrieval distributed, data source oriented architecture intelligent download of significant Web content resource preserving approach Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission 04/17/98 jfiedler@cise.ufl.edu

Mobile Crawling Advantages Remote page selection determine significance of a page prior to transmission applicable for specialized search engines Remote page filtering use effective page representation model applicable for non-fulltext search engines Remote page compression compress page data prior to transmission applicable for all search engines 04/17/98 jfiedler@cise.ufl.edu

Crawler Specification Rule based programming paradigm represent crawler data as facts (e.g. page-facts) describe crawler behavior as a set rules which operate upon facts Advantages it is easier to specify crawling rules than to devise a crawling algorithm no need to model control flow rule based programs have very simple runtime states 04/17/98 jfiedler@cise.ufl.edu

Mobile Crawling Architecture 04/17/98 jfiedler@cise.ufl.edu

Mobile Crawling Architecture Distributed Crawler Runtime Environment provide platform independent execution environment virtual machine for remote crawler execution communication layer for crawler migration Application Framework support for crawler specification and configuration crawler manager for crawler specification query engine as crawler/application interface archive manager as database connectivity framework 04/17/98 jfiedler@cise.ufl.edu

Crawler Virtual Machine How to execute a rule based crawler specification? crawler execution = rule application upon fact base use inference engine for the the rule application process 1. Initialization insert rules and facts into inference engine 2. Rule application start rule application process within inference engine 3. Finalization extract rules and facts once the rule application stopped 04/17/98 jfiedler@cise.ufl.edu

Crawler Virtual Machine 04/17/98 jfiedler@cise.ufl.edu

Crawler Query Engine How to access the crawler knowledge? provide a query facility to query the crawler fact base implement a SQL subset as query language represent query result as data tuples, not as facts allows the user to reason about crawling results query engine implementation uses inference engine Query engine serves as the primary interface between the user application and the mobile crawler 04/17/98 jfiedler@cise.ufl.edu

Crawler Query Engine 04/17/98 jfiedler@cise.ufl.edu

Performance Evaluation Setup Use distributed virtual machines to support mobile as well as traditional Web crawling 04/17/98 jfiedler@cise.ufl.edu

Performance Evaluation Controlled environment setup static HTML data set with known properties personal HTTP server unshared communication channel (dialup line) Measurements 1. network load for traditional (stationary) crawler 2. network load for mobile crawler without page compression 3. network load for mobile crawler with page compression 04/17/98 jfiedler@cise.ufl.edu

Benefit of Remote Page Selection Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection 04/17/98 jfiedler@cise.ufl.edu

Benefit of Remote Page Filtering Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved) 04/17/98 jfiedler@cise.ufl.edu

Benefit of Page Compression Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages 04/17/98 jfiedler@cise.ufl.edu

Costs and Benefits Overhead Benefits without page compression overhead due to crawler migration (<5K) overhead due to facts based data representation (6%) Benefits without page compression as soon as less than 85% per page needs to be preserved as soon as less than 90% of all pages are transmitted Benefits with page compression reduction in network load by a factor of 4.5 04/17/98 jfiedler@cise.ufl.edu

Summary and Conclusion Mobile crawling advantages: approach fits better in distributed web environment approach beneficial for all types of search engines better support for specialized search engines network overhead due to crawler mobility is small Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data. Approach provides a base for smart Web crawling. 04/17/98 jfiedler@cise.ufl.edu

Future Work Security Crawler mobility support crawler identification based on digital signatures restrict crawler execution to positive identified crawlers implement virtual machine as a secure sandbox Crawler mobility support integrate virtual machine into web servers Mobile crawling algorithms optimize crawling algorithms with crawler mobility in mind (e.g. crawler communication) 04/17/98 jfiedler@cise.ufl.edu