SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and.

Slides:

Advertisements

Similar presentations

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Introduction Simple Random Sampling Stratified Random Sampling

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Fast Algorithms For Hierarchical Range Histogram Constructions

Estimation in Sampling

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Visual Recognition Tutorial

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Ranked Set Sampling: Improving Estimates from a Stratified Simple Random Sample Christopher Sroka, Elizabeth Stasny, and Douglas Wolfe Department of Statistics.

Simple Linear Regression

Statistical Methods Chichang Jou Tamkang University.

Reduced Support Vector Machine

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Ensemble Learning: An Introduction

Evaluating Hypotheses

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

A new sampling method: stratified sampling

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Statistics for Managers Using Microsoft® Excel 7th Edition

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Sampling: Theory and Methods

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

by B. Zadrozny and C. Elkan

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

Random Sampling, Point Estimation and Maximum Likelihood.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Gagan Agrawal Fan Wang, Tantan Liu Ohio State University.

1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.

BCS547 Neural Decoding.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Classification Ensemble Methods 1

ApproxHadoop Bringing Approximations to MapReduce Frameworks

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Computer Performance Modeling Dirk Grunwald Prelude to Jain, Chapter 12 Laws of Large Numbers and The normal distribution.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Chapter 7. Classification and Prediction

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Data Mining K-means Algorithm

Probabilistic Data Management

Objective of This Course

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Stratified Sampling for Data Mining on the Deep Web

Clustering Wei Wang.

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Answering Cross-Source Keyword Queries Over Biological Data Sources

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Fan Wang Advisor: Prof. Gagan Agrawal Ohio State University

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer The Deep Web The definition of “the deep web” from Wikipedia The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines. Some Examples: Expedia, Priceline

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer The Deep Web is Huge and Informative 500 times larger than the surface web 7500 terabytes of information (19 terabytes in the surface web) 550 billion documents (1 billion in the surface web) More than 200,000 deep web sites Relevant to every domain: scientific, e-commerce, market 95 percent of the deep web is publicly accessible (with access limitations)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer How to Access Deep Web Data 1. A user issues query through input interfaces of deep web data sources 2. Query is translated into SQL style query 3. Trigger search on backend database 4. Answers returned through network Select price From Expedia Where depart=CMH and arrive=SEA and dedate=“7/13/10” and redate=“7/16/10”

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Drawbacks Constrained flexibility –Types of queries: aggregation query, nested query, queries with grouping requirement –User specified predicates Inter-dependent data sources –Users may want data from multiple correlated data sources Long latency –Network transmission time –Denial of service

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Goal Develop a deep web search tool which could support online (real time) structured, and high level queries (semi)automatically

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Challenges Challenges for Integration –Self-maintained and created –Heterogeneous, hidden and dynamically updated metadata Challenges for Searching –Limited data access pattern –Data redundancy and data quality –Data source dependency Challenges for Performance –Network latency –Fault tolerance issue

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Our Contributions (1) Support online aggregation over deep web –Answer deep web aggregation search in a OLAP fashion –Propose novel sampling techniques to find accurate approximate answers in a timely manner Support low selectivity query over deep web –Answer low selectivity query in presence of limited data access –Propose a novel Bayesian method to find optimal stratification for hidden selective attribute

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Our Contributions (2) Support structured SQL queries over the deep web –Support SPJ, aggregation, and nested queries Automatic hidden schema mining –A statistic framework to discover hidden metadata from deep web data sources Novel query caching mechanism for query optimization Effective fault tolerance handling mechanism

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer System Overview Hidden schema discovery Data source integration Structured SQL query Sampling the deep web Online aggregation Low selectivity query

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Outline Introduction Sampling methods for online aggregation query –Motivation –ANS and TPS methods –Evaluation Stratification methods for low selectivity query –Motivation –Harmony search and Bayesian based adaptation –Evaluation Future work and conclusion

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Online Aggregation Motivation Aggregation queries requiring data enumeration I want to know the average airfare from US to Europe across all major US airline flights in the next week Select AVG(airfare) From AirTable AT Where AT.depart=any US city and AT.arrive=any European city Need Enumeration!!! Relational Database Deep Web Data Source NYC, London Boston, Paris LA, Rome AA: 500 UA: 550 USAir: 450 Delta: 400 UA: 600 AA: 650 Where do you get these names? How long can you wait? What if the data is updated dynamically?

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Initial Thoughts Sampling: Approximate answers Simple random sampling (SRS) –Every data record has the same probability to be selected Drawbacks of SRS –Bad performance on skewed data –High sampling cost to perform SRS on deep web (Dasgupta et al, HDSampler)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer We Want To Achieve Handle data with (probably high) skew –Top 20 IT companies account for 80% of the sales among all top 100 IT companies in 2005 –Hidden data (hard to gather statistical information) –Has skew or not? –Unknown data distribution –Pilot sample, how much can you trust your pilot sample? Lower sampling cost for sampling deep web data

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Our Contributions Two Sampling Algorithms –ANS (Adaptive Neighborhood Sampling): handling skewed data, sample skew causing data easier –TPS (Two Phase adaptive Sampling): lower sampling cost Performance –Accurate estimates without prior knowledge –ANS and TPS outperform HDSampler by a factor of 4 on skewed data –TPS has one-third of the sampling cost of HDSampler

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Background Knowledge A survey on a type of rare monkey, which only lives in a small but dangerous area in southern China Associated Samples

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Why this is good and Can we use it? Sample more rare but significant data records –Good for handling with skewed data Associated samples have relatively low sampling cost –Cheaper than SRS with the same sample size Yes, we can use it! ~~ with modification –Much real world data has skew (IT company, household income) –Rare data often form clusters –Deep web data sources often return multiple records w.r.t. one input sample

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Drawbacks Performance depends on the initial sample Initial sample is simple random sample No cost limit explicitly considered –What is the size of the initial sample? –How many associated samples should be added?

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Select a random sample –Stop random sampling if any of the two termination rules applies We have sampled k number of units of interest We have reached the cost limit –Take the sampled data record, add it to our sample –If this data record is a unit of interest Obtain its neighbors (neighborhood sampling) For each data records obtained from neighborhood sampling –Add it to our sample –Perform recursive neighborhood sampling if necessary –If neighborhoods are too large Increase unit of interest threshold –If neighborhoods are too small Decrease unit of interest threshold The ANS Sampling Algorithm Aggressively sample skew causing data Control sampling cost

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer ANS Example Estimate the total sale of IT companies in 2005 Each point represents a company’s sale record Color shows the scale of the sale value, the darker, the higher Neighborhood of data records is defined according to some rules

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Select initial random samples sequentially until we have k units of interest (k=3) Unit of interest: sales larger than a threshold 2. Explore the neighborhood recursively for all units of interest until the total number of samples reach a limit 3. If too many neighbors are Included, we increase the threshold of unit of interest

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Estimators and Analysis for ANS Estimator for AVG, fixed unit of interest threshold Lemma 1: The above estimator is a biased estimator, when k is small, the bias is very small We also proposed estimator for variable unit of interest threshold using post-stratification is the estimated average value from the h th stratum (all samples corresponding to one specific unit of interest threshold value) β is the percentage of units of Interest w.r.t. entire data set β=(k-1)/(n-1)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Drawbacks of ANS Initial samples are simple random samples SRS: one input search only gets one sample from the output page High cost

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer The TPS Sampling Algorithm Partition data set D into M sub-spaces –According to combinations of input attribute values Random select m sub-spaces Select a sample of size n1 from each of the m selected sub-spaces –First phase sampling –For one selected sub-space, if any such selected data record is a unit of interest, proceed Select a sample of size n2 from the corresponding sub-spaces –Second phase sampling –Sub-spaces contains units of interest in the first sampling phase may give us more units of interests in the second sampling phase Aggressively draw skew Causing data Low sampling cost

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Unit of interest: sales larger than a threshold 1.Random select the sub spaces (input search) 2. Random select N1 samples In each selected subspace N1=4 3. If any sample selected in a sub space is a unit of interest, select N2 samples from the sub space again, N2=3

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Evaluation Data sets –Synthetic data sets: generated using MINITAB, varying data skew from 1 to 9 –US Census data: 2002 US economic census data on wholesale trade product lines listed by the kind of business (skew=8) –Yahoo! Auto: prices of used Ford cars from 2000 to 2009 located within 50 miles of a zipcode (skew=0.7) Metrics –AER: Absolute Error Rate –Sampling cost: number of input samples needed Methods –ANS –TPS –SRS

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer ANS Performance w.r.t. Data Skew 1. AER increases moderately with the increase of data skew 2. When k=8, AER is consistently smaller than 19% 3. Larger k doesn’t help much to improve

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer TPS Performance w.r.t. Data Skew 1. With the increase of data skew, AER increases moderately 2. For sub-space sample size of 30%, AER always smaller than 17%

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer AER Comparison on Synthetic Data 1. All methods work well on small skew data 2. HDSampler has bad performance on data with skew>2 3. Our two methods outperform HDSampler by a factor of 5

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer AER Comparison on US Census Data

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer AER Comparison on Yahoo! Data 1. For AVG, three methods are comparable in terms of accuracy 2. For MAX, our methods are better (by a factor of 4)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Sampling Cost Comparison on Yahoo Data 1. To achieve a low AER, TPS has one-third of the sampling cost of HDSampler 2. The total number of samples TPS obtained (with same cost in time) is twice the number of samples HDSampler obtained More Results

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Outline Introduction Sampling methods for online aggregation query –Motivation –ANS and TPS methods –Evaluation Stratification methods for low selectivity query –Motivation –Harmony search and Bayesian based adaptation –Evaluation Future work and conclusion

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Motivating Example: Low Selectivity Random Sampling –None of the selected records satisfies the low selectivity predicate Stratified Sampling –Partitioning attribute, selective attribute –How to perform stratification Distance based methods (clustering, outlier indexing) Selective attribute is not queriable –Auxiliary attribute based stratification »Dalenius and Hodges’s method, Ekman’s method, Gunning and Horgan’s geometric method –Require strong correlation between auxiliary and selective attribute

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Our Contributions We focus on queries in the following format Propose a Bayesian adaptive harmony search stratification method to stratify a hidden selective attribute based on an auxiliary attribute The stratification accurately reflects the distribution of the hidden selective attribute when correlation is weak Estimations from our stratification outperforms existing methods by a factor of 5 Estimation accuracy obtained from our method is higher than 95% for 0.01% selectivity queries

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Background: Stratification Purpose –Within stratum homogeneity Partition data set R into k strata –Partitioning attribute x has value range, find k-1 breaking points Sampling allocation –Neyman allocationNeyman allocation –Bayesian Neyman allocation

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Background: Harmony Search (1) A phenomenon-mimicking meta-heuristic algorithm inspired by the improvisation process of musicians Optimize an objective function Initialize the harmony memory, M random guesses of the decision variable vector

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Background: Harmony Search (2) Improvise a new harmony from the harmony memory –A new harmony vector is generated from two parameters HMCR and PAR. Update harmony memory Termination

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Algorithm Overview (1) Find the best stratification of the selective attribute

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Algorithm Overview (2) Consider an auxiliary attribute as the partitioning attribute Harmony memory –A list of breaking points vectors of the auxiliary attribute

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Harmony Objective Function What is a good stratification? –Condition 1: homogenous data within stratum Small sample variance summation across strata –Condition 2: stratification with high precision, i.e., the low selective region of the distribution is exclusively covered by some strata

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Sample Allocation Determine the number of samples assigned to each stratum What stratum should be put heavier sampling weight? –Stratum with diversity (more heterogeneous) High sample variance –Stratum which covers large percentage of the low selective region High precision

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Parameter Adaptation: Overview Two parameters: HMCR and PAR X: value of HMCR parameter Y: precentage of cases in which we obtain better harmony vector

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Method Overview Estimate unknown parameter (posterior distribution) based on prior knowledge or belief (prior distribution) Observed data y, unknown parameter θ In our scenario –Observed data: the harmony parameter values which yield better harmony vector –Unknown parameter: adaptation pattern of harmony parameters

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Adaptation (1) Assume the adaptation patterns of parameters HMCR and PAR follows probability functions Represent our belief in θ as a prior probability distribution Observe our data: the HMCR and PAR parameters which yield the best new harmony vector based on the current harmony memory, denoted as Based on, compute θ (posterior distribution), denoted as (Details)(Details) Compute the adapted parameter value as

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Adaptation (2) Probability function Prior distribution of θ Details

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Evaluation Data sets –Synthetic data sets: generated using MINITAB, varying correlation between auxiliary attribute and selective attribute from 1 to 0.3 –US Census data, correlation (number of settlements and sale) 0.56 –Yahoo! Auto, correlation (year and mileage) 0.7 Metric –AER: Absolute Error Rate Methods –Leaps and Bounds (L&B)Leaps and Bounds (L&B) –Dalenius and Hodges (D&H)Dalenius and Hodges (D&H) –Random Sampling (No Stratification) –Ours (HarmonyAdp)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer HarmonyAdp Performance (1) 1. Higher accuracy with more iterations 2. Iteration>40 (2% of total data size), low AER for all selectivity values 3. Robust with respect to data correlation

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer HarmonyAdp Performance (2)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Four Methods Comparison (1) 1. All methods work well on easy queries 2. When queries get harder, D&H, L&B and Random degrade severely. Ours has good performance 3. For 0.1% queries, our method outperforms others by a factor of 5 4. For extreme low selectivity queries, the error rate from our method is always lower than 18%

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Four Methods Comparison (2) 1. Similar pattern 2. Performance degrade of other three algorithms on Auto data is small, because data correlation is high More Results

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Conclusion Propose a system for exploring and querying deep web data sources Propose novel sampling methods to handle online aggregation and low selectivity queries Propose novel query planning algorithms to answer structured SQL queries over the deep web System also has the following features –Self-healing from unavailable/inaccessible data sources –Query optimization –Automatic hidden schema mining and integration

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Future Work Better understanding the deep web data –Data quality Reliability, authority, data provenance models Data source correlation –Data distribution Learning approximate data distribution with a bounded confidence interval Handling consistency issue Better querying the deep web –Structured query Existence (Is-there) queries, data mining queries –Semantic query, incorporate semantic web with deep web Automatic ontology generating from deep web Semantic query answering –Result ranking Ranking results in heterogeneous formats Combining results from different/correlated data sources

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Thank You!

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Backup ANS Estimator TPS Estimator SPJ NP Hard Aggregation NP Hard Bayesian Inference Other System Modules

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Adaptation (2) Probability function –HMCR ranges from 0 to 1 –HMCR varies from small values to large values –Beta distribution best mimics this adaptation pattern

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Adaptation (3) Prior distribution of θ –Belief about θ before any data is observed –Belief about θ before the execution of harmony search algorithm –What is our belief? For Beta distribution, θ<β, peak occurs before HMCR=0.5 θ>β, peak occurs after HMCR=0.5 At the beginning, HMCR is small

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Lemma on Biasness

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer TPS Estimator

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Details on theorem, sufficient statistics and etc

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Details on max

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer SPJ NP hard Let C={C1,C2,…,Cm} be an instance of the set cover problem. The node set V of G is given by V={C1,…Cm}

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Aggregation NP hard Let C={C1,C2,…,Cm} be an instance of the set cover problem. The node set V of G is given by V={C1,…Cm, W} The grouping attribute is s*, and node W covers s*

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer ANS Estimator Bias Evaluation 1. k=8, we achieve the best estimation 2. k>10, AER increases moderately, bias occurs 3. k=10, we reach the cost limit, bias occurs

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Harmony Objective Function What is a good stratification? –Condition 1: homogenous data within stratum Small sample variance summation across strata –Condition 2: stratification with high precision, i.e., the low selective region of the distribution is exclusively covered by some strata

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Outline Introduction Sampling methods for online aggregation query –Motivation –ANS and TPS methods –Evaluation Stratification methods for low selectivity query –Motivation –Harmony search and Bayesian based adaptation –Evaluation Answering complex structured query Query optimization and fault tolerance Future work and conclusion

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Motivating Example: Structured Query Biologists have identified the gene X and protein Y are contributors of a disease. They want to examine the SNPs (Single Nucleotide Polymorphisms) located in the genes that share the same functions as either X or Y. Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer The gene has the same functions as XThe gene has the same functions as YThe frequency information of the SNPs located in these genes and filtered by heterozygosity values

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Query Planning Algorithm We propose query planning algorithms to support –Select-Project-Join query –Aggregation query –Uncorrelated nested query We also extend our algorithm to the following SQL operators –Union and OR –Order by –Having Planning problem: NP hard sub-graph search problem –We propose heuristic graph search algorithms

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Query Optimization Techniques Data redundancy across data sources “All-In-One” result page of deep web data sources –Deep web data sources: return all in one page no matter you want it or not Query plan driven caching mechanism to optimization query execution –Reuse similar previous query plans –Increase the possibility of previous data reuse Results –Our method achieves twice the speed of a baseline method for half the cases

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Fault Tolerance Techniques Inaccessibility and unavailability of deep web data sources Self-healing approach for deep web query execution Data redundancy based fault tolerance techniques –Hide unavailable and/or inaccessible data sources Steps –Find minimal impacted subplan –Find maximal fixable query –Generate a replacement plan Result –For half the cases, our method achieves 50% speedup compared to a baseline method

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Initial Thoughts Sampling: Approximate answers Simple random sampling (SRS) –Every data record has the same probability to be selected Drawbacks of SRS –Bad performance on skewed data –High sampling cost to perform SRS on deep web (Dasgupta et al, HDSampler) D=(1,1,1,1,1,1,1,1,10,1000)True average: SRS samples of size 2: S1(1,1)=1, S2(1,10)=5.5, S3(1,1000)=500.5, S4(10,1000)=505

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer The Deep Web is Informative Structured Data –Surface web: text format –Deep web: relational data in backend relational databases Topic Specific Data –Biology, Chemistry, Medical, Travel, Business, Academia, and many more… Publicly Accessible Data –95 percent of the deep web is publicly accessible (with access limitations)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Sufficient Statistics No other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter The concept is most general when defined as follows: A statistic T(X) is sufficient for underlying parameter θ precisely if the conditional probability distribution of the data X, given the statistic T(X), is not a function of the parameter θ This is the reason that the final sample s i is a sufficient statistics for the estimator (all information must be contained in the final sample)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Rao–Blackwell theorem In statistics, the Rao–Blackwell theorem, is a result which characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean- squared-error criterion or any of a variety of similar criteria. The Rao–Blackwell theorem states that if g(X) is any kind of estimator of a parameter θ, then the conditional expectation of g(X) given T(X), where T is a sufficient statistic, is typically a better estimator of θ, and is never worse. The mean squared error of the Rao–Blackwell estimator does not exceed that of the original estimator.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Conditional Expectation Let X and Y be discrete random variables, then the conditional expectation of X given the event Y=y is a function of y over the domain of Y

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Chebyshev Inequality

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Computation (1) The posterior estimation for θ can be computed using Using Monte Carlo Sampling –Generate a sequence of random variables, having common density –Using strong law of large numbers, we could approximate the above formula as

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Bayesian Computation (2) We choose to be equal to the prior distribution

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Strong Law of Large Numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Neyman Allocation

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Leaps and Bounds Assumptions –Strong correlation –Uniform distribution within stratum (auxiliary attribute) –Equal coefficients of variance in each stratum (auxiliary attribute) Breaks follows geometric progression

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer Dalenius and Hodges Assumptions –Strong correlation –Auxiliary attribute uniform distribution within stratum –So we have Breaks in following positions