Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Slides:

Advertisements

Similar presentations

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Advertisements

Randomized Motion Planning for Car-like Robots with C-PRM Guang Song and Nancy M. Amato Department of Computer Science Texas A&M University College Station,

Aki Hecht Seminar in Databases (236826) January 2009

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

Randomized Motion Planning for Car-like Robots with C-PRM Guang Song, Nancy M. Amato Department of Computer Science Texas A&M University College Station,

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.

Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.

1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.

The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Gagan Agrawal Fan Wang, Tantan Liu Ohio State University.

Chapter 3 Computational Molecular Biology Michael Smith

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.

P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.

1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.

18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.

Optimizing Parallel Algorithms for All Pairs Similarity Search

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Lin Lu, Margaret Dunham, and Yu Meng

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Objective of This Course

Stratified Sampling for Data Mining on the Deep Web

Web Mining Department of Computer Science and Engg.

Bidirectional Query Planning Algorithm

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

Answering Cross-Source Keyword Queries Over Biological Data Sources

Resource Allocation for Distributed Streaming Applications

Applying principles of computer science in a biological context

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer Science and Engineering Ohio State University, Columbus, OH Department of Computer Science Kent State University, Kent, OH 44242

Introduction The emerge of deep web –Deep web is huge –Different from surface web –Challenges for integration Not accessible through search engines Inter-dependences among deep web sources

Motivation Example ERCC6 dbSNP Entrez Gene Sequence Database Alignment Database Nonsynonymous SNP AA Positions for Nonsynonymous SNP Encoded Protein Encoded Orthologous Protein Protein Sequence occurring Given a gene ERCC6, we want to know the amino acid occurring in the corresponding position in orthologous gene of non- human mammals

Observations Inter-dependences between sources Time consuming if done manually Intelligent order of querying Implicit sub-goals in user query

Contributions Formulate the query planning problem for deep web databases with dependences Propose a dynamic query planner Develop cost models and an approximate planning algorithm Integrate the algorithm with a deep web mining tool

Roadmap Introduction and Motivation Problem Formulation Planning Algorithm Evaluation Related Work Conclusion

Formulation Universal Term Set Query Q is composed of two parts –Query Key Term: focus of the query (ERCC6) –Query Target Terms: attributes of interesting (Alignment) Data sources –Each data source D covers an output set –Each data source D requires an input set Find a query plan, a ordered list of data sources –Covers the query target terms with maximal benefit –As short as possible –NP-Complete problem

Problem Scenario

Production System Working Memory Target Space Production Rules Recognize-Act Control R1 R2R3 Database 1 Database 2 Database 3

Roadmap Introduction and Motivation Problem Formulation Planning Algorithm Evaluation Related Work Conclusion

Algorithm Dependency Graph Planning Algorithm Detail Benefit Model

Dependency Graph Dependency relation –Format: –Hypergraph Hyperarc: ordered pair (parents, child) AND node Neighbors

Concepts Database Necessity (DN) –Each term is associated with a DN value –Measures the extraction priority of a term and the importance of a database scheme –For term t, if k database schemes can provide it, the DN value is

Concepts Hidden Nodes –Nodes connecting current working state and the target space Partially Visualize Hidden Nodes –Multiple layers of hidden nodes bring difficulty

Visualize Hidden Nodes Target Space Enlargement Target Space:{t1} 1. Find a target term t with DN=1 2. Visualize the database D which provides t 3. Add D’s input set to target space 4. Repeat above steps till done {t1,t2,t3}{t1,t2,t3,t4}{t1,t2,t3,t4,t5}

Planning Algorithm Detail

Planning Algorithm Detail The approximation ratio of our greedy algorithm is

Benefit Model Select an appropriate rule at each iteration of the planning algorithm Four metrics –Database Availability –Data Coverage (DC) –User Preference (UP) –Potential Importance (PI)

Data Coverage The number of query target terms covered by the current rule, but has not yet been covered by previous selected rules

User Preference Domain users have preference for certain database (rule) for a particular term A collaborating biologist provides the preference values Term provided by databases Rule covers the following unfound target terms –Preference for is

Potential Importance Some database is more important due to its linking to other important databases (e.g.)(e.g.) A database is more important –Find the necessary databases which provide unfound target terms –More such necessary databases can be reached from The potential importance for a rule

Roadmap Introduction and Motivation Problem Formulation Planning Algorithm Evaluation Related Work Conclusion

Experiment Setup SNPMiner System –Integrates 8 deep web databases –Provides a unified user interface Experimental Queries

Planning Algorithm Comparison Naïve Algorithm (NA) –Select all rules which can be fired at each iteration until all requested terms are covered –No rule selection strategy used Optimal Algorithm (OA) –Search the entire space Production Rule Algorithm (PRA)

PRA vs. NA 1. All ratio data points smaller than 1 2. PRA generates much faster query plans than NA

PRA vs. OA 1. All ratio data points distributed around 1 2. In terms of query plan execution time, PRA has performance close to OA 3. In most cases, PRA generates exactly the same plan as OA

Enlarge Target Space 1. Query plans generated with enlargement run faster 2. Query plans generated with enlargement are shorter

Scalability Our system has good scalability

Roadmap Introduction and Motivation Problem Formulation Planning Algorithm Evaluation Related Work Conclusion

Related Work Query Planning –Navigational based query planning –SQL based query planning –Bucket Algorithm Deep Web Mining –Database selection –E-commerce oriented, no dependency Keyword Search on Relational Databases Select-Project-Join Query Optimization

Conclusion Formulate and solve the query planning problem for deep web databases with dependencies Develop a dynamic planning algorithm with an approximation ratio of ½ Our benefit model is effective Our algorithm outperforms the naïve algorithm, and obtains optimal results for most cases

Questions/Comments?