Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati Arizona State University 2008-02-04 Summerized By Sungchan Park

Copyright  2008 by CEBT Introduction  More and more data is becoming accessible via web servers which are supported by backend autonomous databases E.g. Cars.com, Realtor.com, Google Base, Etc. Center for E-Business Technology Autonomous Database Autonomous Database Autonomous Database Mediator

Copyright  2008 by CEBT Web DB.s are Incomplete!  Incomplete Entry  Inaccurate Extraction  Heterogeneous Schemas  User-Defined Schemas Center for E-Business Technology

Copyright  2008 by CEBT Problem  Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints  Although there has been work on handling incompleteness in databases, much of it has been focused on single databases on which the query processor has complete control. Modify databases directly by replacing null values with likely values. – Not applicable to autonomous databases Center for E-Business Technology

Copyright  2008 by CEBT Possible Naïve Approaches Query Q: (Body Style = Convt)  C ERTAIN O NLY Return only certain answer – Low Recall  A LL R ETURNED Return all answer having Body Style = Convt or Body Style = Null – Low Precision, Infeasible  A LL R ANKED Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user – Costly, Infeasible Center for E-Business Technology

Copyright  2008 by CEBT QPIAD  Solved the problem by generating rewritten queries according to a set of mined attribute correlation rules. Approximate Functional Dependency(AFD) Naïve Bayesian Classifier Center for E-Business Technology

Copyright  2008 by CEBT QPIAD Solution Center for E-Business Technology

Copyright  2008 by CEBT QPIAD Architecture Center for E-Business Technology

Copyright  2008 by CEBT Overall Process 1.Learn 2.Rewrite 3.Rank 4.Explain Center for E-Business Technology

Copyright  2008 by CEBT #1. Learn - AFD  Learn Attribute Correlations Approximate Functional Dependencies(AFD) Approximate Keys(Akeys) – For pruning Learn by TANE algorithm  Y. Huhtala, et al. Efficient discovery of functional and approximate dependencies using partition. 1998.  Pruning example AFD {A1, A2} ~> A3 Akey {A1} Center for E-Business Technology

Copyright  2008 by CEBT #1. Learn - Naïve Bayesian Classifier  Learn Value distribution by NBC Using mined AFD as selected feature E.g. – AFD {Make, Body} ~> Model – P(Model = Accord | Make = Honda, Body = Coupe) = ? Center for E-Business Technology

Copyright  2008 by CEBT #1. Learn - Selectivity  SmplSel(Q)*SmplRatio(R)*PerInc(R) SmplSel(Q) = Selectivity of rewritten query issued on sample SmplRatio(R) = Ratio of original database size over sample PerInc(R) = Percent of incomplete tuples while creating sample Center for E-Business Technology

Copyright  2008 by CEBT #2. Rewrite 1.Get base result(Certain answers) 2.Generate rewritten queries by base result and learned AFD Center for E-Business Technology Rewritten Queries

Copyright  2008 by CEBT #3. Rank 1.Select top-k queries based on F-Measure 2.Reorder selected query based on P 3.Retrieve tuples Center for E-Business Technology P = learned Prob. R = selectivity

Copyright  2008 by CEBT #4. Explain Center for E-Business Technology

Copyright  2008 by CEBT Other Issues: Correlated Source Center for E-Business Technology

Copyright  2008 by CEBT Other Issues: Handling Aggregation Center for E-Business Technology

Copyright  2008 by CEBT Empirical Evaluation: Quality  QPIAD vs. A LL R ETURNED A LL R ETURNED has low precision because not all tuples with missing values on the constrained attributes are relevant to the query QPIAD has a much higher precision than A LL R ETURNED as it aims to retrieve tuples with missing values on the constrained attributes which are very likely to be relevant to the query Center for E-Business Technology

Copyright  2008 by CEBT Empirical Evaluation: Efficiency  QPIAD vs. A LL R ANKED A LL R ANKED approach is often infeasible as direct retrieval of null values is not often allowed QPIAD is able to achieve the same level of recall as A LL R ANKED while requiring much fewer tuples to be retrieved Center for E-Business Technology

Copyright  2008 by CEBT Empirical Evaluation: Robustness  Robustness w.r.t. Sample Size QPIAD is robust even when face with a relatively small data sample Center for E-Business Technology

Copyright  2008 by CEBT Empirical Evaluation: Extensions  Aggregates Prediction of missing values increases the fraction of queries that achieve higher levels of accuracy Approximately 20% more queries achieve 100% accuracy when prediction is used  Join As alpha is increased, we obtain a higher recall without sacrificing much precision Center for E-Business Technology

Copyright  2008 by CEBT Related Work  Querying Incomplete Databases Possible World Approaches – tracks the completions of incomplete tuples (CoddTables, V- Tables, Conditional Tables) Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers  Probabilistic Databases Tuples are associated with an attribute describing the probability of its existence However, in our work, the mediator does not have the capability to modify the underlying autonomous databases  Query Reformulation / Relaxation Aims to return similar or approximate answers to the user after returning or in the absence of exact answers Our focus is on retrieving tuples with missing values on constrained attributes  Learning Missing Values Common imputation approaches replace missing values by substituting the mean, most common value, default value, or using kNN, association rules, etc. Our work requires schema level dependencies between attributes as well as distribution information over missing values Center for E-Business Technology

Copyright  2008 by CEBT Contribution  Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns Query Rewriting  Retrieves answers with missing values on constrained attributes without modifying the underlying databases AFD-Enhanced Classifiers  Rewriting & ranking considers the natural tension between precision and recall F-Measure based ranking  AFDs play a major role in: Query Rewriting Feature Selection Explanations Center for E-Business Technology

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

Similar presentations

Presentation on theme: "Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

Similar presentations

Presentation on theme: "Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati."— Presentation transcript:

Similar presentations

About project

Feedback