SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Aki Hecht Seminar in Databases (236826) January 2009
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Overview of Search Engines
Query Processing Presented by Aung S. Win.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Network Aware Resource Allocation in Distributed Clouds.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Querying Structured Text in an XML Database By Xuemei Luo.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Dimitrios Skoutas Alkis Simitsis
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Gagan Agrawal Fan Wang, Tantan Liu Ohio State University.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Algorithmic Detection of Semantic Similarity WWW 2005.
Multi-object Similarity Query Evaluation Michal Batko.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
Of 24 lecture 11: ontology – mediation, merging & aligning.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Information Retrieval in Practice
Lesson # 9 HP UCMDB 8.0 Essentials
Data Mining K-means Algorithm
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Personalized Social Image Recommendation
Probabilistic Data Management
Introduction to Query Optimization
Associative Query Answering via Query Feature Similarity
Summarizing Entities: A Survey Report
Toshiyuki Shimizu (Kyoto University)
Discovering Functional Communities in Social Media
G-CORE: A Core for Future Graph Query Languages
Stratified Sampling for Data Mining on the Deep Web
RDF graph summaries 金成 2014/11/3.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Block Matching for Ontologies
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Web Mining Department of Computer Science and Engg.
Bidirectional Query Planning Algorithm
Panagiotis G. Ipeirotis Luis Gravano
Answering Cross-Source Keyword Queries Over Biological Data Sources
Resource Allocation for Distributed Streaming Applications
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Fan Wang Advisor: Prof. Gagan Agrawal Ohio State University

The Deep Web The definition of “the deep web” from Wikipedia The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

The Deep Web is Huge 500 times larger than the surface web 7500 terabytes of information (19 terabytes in the surface web) 550 billion documents (1 billion in the surface web) More than 200,000 deep web sites Relevant to every domain: scientific, e-commerce, market

The Deep Web is Informative Deeper content than surface web Surface web: text format Deep web: specific and relational information More than half of the deep web content in topic-specific databases Biology, Chemistry, Medical, Travel, Business, Academia, and many more… 95 percent of the deep web is publicly accessible

Hard to Use the Deep Web Challenges for Integration Self-maintained and created Heterogeneous and hidden metadata Dynamically updated metadata Challenges for Searching Standard input format Data redundancy and data source ranking Data source dependency Challenges for Performance Network latency and caching mechanism Fault tolerance issue

Motivating Example (1) Biologists have identified the gene X and protein Y are contributors of a disease. They want to examine the SNPs (Single Nucleotide Polymorphisms) located in the genes that share the same functions as either X or Y. Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population.

Motivating Example (2) The frequency information of the SNPs located in these genes and filtered by heterozygosity values The gene has the same functions as X The gene has the same functions as Y

Motivating Example (3) How do you know NCBI Gene could provide gene function information given the gene name? Three data sources, dbSNP, Alfred, and Seattle, could provide SNP frequency data, why do you choose dbSNP? What if SNP500Cancer data source is unavailable? Do NCBI Gene and GO data source both use “function” to represent the meaning of “gene function”? I cannot filter SNP by heterozygosity values on dbSNP A path clearly guides the search

Our Contribution: SEEDEEP System Discover data source inter-dependency Discover data source metadata Generate query plans for search Fault Tolerance mechanism Query caching mechanism

Outline Introduction and Motivation System Core Query planning problem Query planning algorithms Other system components Query caching Fault tolerance Schema mining Proposed work

What queries does our system support? They want to examine the SNPs located in the genes that share the same functions as either X or Y. Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population. Selection-Projection-Join (SPJ) queries Aggregation-Groupby queries Nested queries: Condition and Entity

Data Source Model (1): Single Data Source Each data source is a virtual relational table Virtual relational data elements MI: must fill-in input attributes OI: optional fill-in input attributes O: output attributes C: inherent data source constraints

Data Source Model (2): Correlated Sources Hyper-graph dependency model Multi-source dependency Dependency relations for data sources D1 and D2 Type 1: D1 provides must fill-in inputs for D2 Type 2: D1 provides optional fill-in inputs for D2

Planning Algorithm Overview Tree representation of user query Each node represents a simple query Query Types: 2. A divide-and-conquer approach 1. Aggregation query 3. A final combination step generates the final query plan 2. Nested entity sub-query 3. Ordinary query

Query Planning Problem for Ordinary Query Ordinary query format Entity keywords, attribute keywords, comparison predicates Standard select-project-join SQL query style Formulation Sub-graph set cover problem, NP-hard Target data source subgraph, can have disconnected components which nodes cover what terms the size should be minimal, our cost model This problem is NP hard we have node and edge ranking functions Starting data source

Bidirectional Query Planning Algorithm (1) Heuristic algorithm based on the algorithm introduced by Kacholia et al. Algorithm overview Starting nodes Target nodes Bidirectional graph traversal

Bidirectional Query Planning Algorithm (2) How to find minimal sub-graph Find the shortest paths from starting nodes to target nodes Dijkstra’s shortest path algorithm Benefit function Data source coverage Data source data quality, ontology based User constraints matching

Query Planning Problem for Aggregation Query Node connection property The aggregation data source(s) must be directly or indirectly connected with the grouping data source. Formulation Sub-graph set cover problem with node connection property constraint NP-hard

Center-spread Query Planning Algorithm (1) Algorithm initialization Starting nodes Target nodes Center nodes: aggregation data source nodes Algorithm overview Graph traversal starts from the center nodes Gradually add center nodes’ neighbors adhering to node connection property

Center-spread Query Planning Algorithm (2) Grouping data source Grouping data source

Query Planning Problem for Nested Entity Query(1) “SNP_Frequency, Gene {Function, X}” Find the genes which have the same functions as X {Gene, Function, X} Find the entities specified by b that have the same value on attribute a as the entities that are specified by e1,…,ek

Query Planning Problem for Nested Entity Query(2) Node linking property “Gene {Function, Protein X}” b a e The linking data source, which is the data source covering keyword a, must be topologically before the data source covering the entity keyword b Formulation Sub-graph set cover problem with node linking property constraint NP-hard

Plan Combination Receiving nodes Ending nodes Ending nodes

Plan Merging Query plans for sub-queries can be similar Reduce the network transmission cost of a query plan Two edges and can be merged if the used input and output of paired data sources is the same Mergeable edges weights Optimal Merging Compatibility graph CG Maximal node weighted clique in CG Modified reactive local search (tabu search) algorithm

Query Execution Optimization: Pipelined Aggregation Performing aggregation in a pipelined manner Reduce transmission cost by early pruning Grouping-first query plans

Query Execution Optimization: Moving Partial Grouping Forward Aggregation-first query plans Conditions Aggregation data source AD covers a term pga 1 to 1 relation between the entity specified by pga and the entity specifed by the grouping attribute N to 1 relation between the entity specified by the aggregation attribute and the entity specified by pga

Query Planning Evaluation (1) Cost model evaluation: query plan size

Query Planning Evaluation (2) Planning Algorithm Scalability 0.03% query planning overhead

Query Planning Evaluation (3) Optimization techniques NO: No optimization technique used Merging: Only perform plan merging Grouping: Only perform two grouping optimizations M+G: Perform both merging and grouping

Outline Introduction and Motivation System Core Query planning problem Query planning algorithms Other system components Query caching Fault tolerance Schema mining Proposed work

Query Caching: Motivation High response time for deep web queries Motivating observations Data source redundancy Data sources return answers in a All-In-One fashion Users issue similar queries in one session Query-Plan-Driven query caching method Not only cache previous data, also query plans Caching query plans increases the possibility of data reuse

Query Caching: Strategy Overview We are given a list of n previous issued queries, each of which has a query plan Pi Given a new query q, we want to generate a query plan for q in the following way Define a reusability metric to identify the previous query plans that is beneficial to reuse Select a set of reusable previous queries and query plans Use a selection function to obtain the sub-query plans we will like to reuse Use a modified query planning algorithm to generate query plan for the new query based on reusable plan templates

Query Caching: Evaluation Three mechanisms compared NC: No Caching DDC: Data Driven Caching PDC: Plan Driven Caching

Fault Tolerance: Motivation Remote data sources are vulnerable to unavailability or inaccessibility Data redundancy across multiple data sources, partial redundancy Use similar data sources to hide unavailable or inaccessible data sources Data redundancy based incremental query processing Not generate new plan from scratch Inaccessible part is suspended Incrementally generate a new part to replace the inaccessible part

Fault Tolerance: Strategy Overview System Model: data redundancy graph model Nodes: data sources Edges: redundancy usage between data source pair Given a query plan P and a set of unavailable data sources UDS, find the minimal impacted sub-plan MISubP Impacted sub-plan: the sub-plan of the original plan P which is rooted at unavailable data sources UDS Minimal impacted sub-plan: an impacted sub-plan with no usable data sources Generate the maximal fixable sub-query of the minimal impacted sub-plan Maximal fixable sub-query doesn’t contain any dead attributes which are covered by the minimal impacted sub-plan Generate a query plan for the maximal fixable sub-query as the new partial query plan

Fault Tolerance: Evaluation Query plan execution time Generate new plan from scratch Our incremental query processing strategy

Schema Mining: Motivation Data source metadata reveals data source coverage information Metadata: input and output attributes Data sources only return a partial set of output attributes in response to a query the ones have non-NULL values for the input Find approximate complete output attribute set

Schema Mining: Strategy Overview Sampling based method A modest sized sample could discover most deep web data source output schema Rejection sampling method to choose the sample A sample size estimator is constructed Mixture model method Sample is not enough Output attributes could be shared among different data sources Data source: probabilistic data source model generates output attributes with certain probability Borrowability among data sources: an output attribute is generated from a mixture of different probabilistic data source models

Schema Mining: Evaluation Four methods compared SamplePC: Sampling + Perfect label classifier SampleRC: Sampling + Real label classifier Mixture: Mixture model method Mixture + Sample: SampleRC + Mixture

Outline Introduction and Motivation System Core Query planning problem Query planning algorithms Other system components Query caching Fault tolerance Schema mining Proposed work

Answering Relationship Search over Deep Web Data Sources: Motivation and Formulation Knowledge is only useful when it is related Linked web data Deep web data sources are ideal sources for linked data Supported by backend relational databases Data on output pages are related Deep web data sources are correlated, input and output relation Deep web data source output pages are hyperlinked with output pages from other data sources Problem Formulation A relationship query RQ={ke1,ke2} Find the terms relate ke1 with ke2

Relationship Query: Proposed Method 1 Use correlation among data sources Q={MSMB, RET} Find the relation between these two genes Connect the data source taking one gene as input and another data source taking the other gene as output Connect the data sources taking two genes as input A modified query planning algorithm introduced in the current work

Relationship Query: Proposed Method 2 Use hyperlinks among different output pages to build relation Two-level source-object graph model Sampled output pages Extract objects (entities) represented as (data source, object name) pair Extract hyperlinks on output pages, pointing from one object to another object in different output pages Data source nodes and object nodes Data source virtual link edges connect correlated data sources Hyperlink edges connects hyperlinked object nodes or connects data source node with its corresponding object nodes Edges are weighted

Relationship Query: Graph Model Data source node Data source virtual link edge Hyperlink edge Edge weight Hyperlink edge object node

Relationship Query: Method 2 Algorithm Shortest Paths Identify two nodes in the graph as path ends Path weight: multiplication of edge weights Shortest N paths: NP-hard problem

Quality-Aware Data Source Selection based on Functional Dependency Analysis: Motivation Current data source selection method Coverage Overlap relevance Quality-aware data source selection Data richness Both sources A and B provide information genes and their encoded proteins A only considers one encoding schema, but B considers two B is better than A, but how to detect? Can we find the information we need? Which one is better?

Quality-Aware Data Source Selection: Proposed Method (1) Functional dependency A functional dependency any two tuples t1 and t2 that have must have The previous example Data source A Data source B Extract functional dependencies Sampling: data tuples from deep web data sources Discover functional dependencies

Quality-Aware Data Source Selection: Proposed Method (2) A set of data sources Each has a set of functional dependencies Functional dependency lattice An attribute set Data source has functional dependency set on

Optimized Query Answering over Deep Web Data Sources: Motivation and Formulation Current technique: minimize the number of data sources with benefit function A more interested aspect: minimized the total query plan execution time Optimization problem 1: single query Minimize response time (ERT), maximize plan quality (RS) Maximize the plan gain per execution unit Optimization problem 2: multiple queries Minimize total response time for multiple queries Scheduling problem, don’t assume similarity among queries

Optimized Query Answering: Proposed Methods Optimization for single query Tabu search framework to find the optimal plan Optimization for multiple queries Query as a job with a list of tasks Data sources as machines Dependencies among task, each task can be performed on a set of machines Data source response time as machine working time Job scheduling problem

Conclusion SEEDEEP: A System for Exploring and quErying DEEP web data sources Query Planning Three query planning algorithms Query planning and execution optimization techniques Other components Query caching: query-plan-driven Fault tolerance: redundancy based incrementally query processing Schema Mining: sampling and mixture model approach Proposed work New query types: relationship query Data source selection: quality-aware New optimization problems: single query and multi-queries