Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

Slides:

Advertisements

Similar presentations

Uncertainty in Data Integration Ai Jing

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Naïve-Bayes Classifiers Business Intelligence for Managers.

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Representing and Querying Correlated Tuples in Probabilistic Databases

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

Dr. Subbarao Kambhampati

Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Presented by Zeehasham Rasheed

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Machine Learning CSE 681 CH2 - Supervised Learning.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University.

Data Access and Security in Multiple Heterogeneous Databases Afroz Deepti.

Facilitating Document Annotation using Content and Querying Value.

Uncertainty Management in Rule-based Expert Systems

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.

Chapter 13: Query Processing

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Subbarao Kambhampati (Arizona State University)

Ishan Sharma Abhishek Mittal Vivek Raj

Probabilistic Data Management

Lecture 16: Probabilistic Databases

Data Integration for Relational Web

Subbarao Kambhampati (Arizona State University)

Panagiotis G. Ipeirotis Luis Gravano

Probabilistic Databases

Probabilistic Ranking of Database Query Results

Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.

WSExpress: A QoS-Aware Search Engine for Web Services

Presentation transcript:

Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof. Chitta Baral Prof. Yi Chen Prof. Huan Liu

Introduction to Web databases n Many websites allow user query through a form based interface and are supported by backend databases n Consider used cars selling websites such as Cars.com, Yahoo! autos, etc

Incompleteness in Web databases n Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! Autos n Web databases are being populated using automated information extraction techniques which are inherently imperfect n The local schema of data sources may not support certain attributes supported by the global schema n Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value Website# of attributes # of tuples incomplete tuples body styleengine autotrader.com %3.6%8.1% carsdirect.com %55.7%55.8%

Problem Statement n Many entities corresponding to tuples with missing values might be relevant to the user query n Current query processing techniques return answers that exactly satisfy the user query –Such techniques return results with high precision but low recall n Relevant Uncertain tuple: A tuple which does not exactly satisfy the query predicates but the entity represented by that tuple might be relevant to the query n How to support query processing over incomplete autonomous databases in order to retrieve ranked uncertain results? nullAccord2003sedan Q:Make=Honda

Challenges Involved n How to predict missing values in autonomous databases? n As autonomous databases are accessible only through form-based interfaces, how to retrieve relevant uncertain answers? –How to keep query processing cost manageable in retrieving uncertain tuples? n How to rank the retrieved uncertain answers?

Related Work n Probabilistic databases –Incomplete databases are similar to probabilistic databases once we assess the probabilities for missing values –TRIO: uncertainty with lineage –ConQuer: handling inconsistency over databases Assume probability distributions are given for uncertain or inconsistent attributes –We assess probability distribution for missing attribute and use it to rank rewritten queries to retrieve relevant answers since the probabilities cannot be stored in databases –Our query rewriting framework is general and can be used by these systems if the databases are autonomous n Handling Missing Values –EM algorithm, Bayes Net, Association rules

Possible Approaches For a query Q:body style = convt 1.Certain Answers Only (CAO): Return certain answers only as in traditional databases 2. All Uncertain Answers (AUA): Null matches any concrete value, hence return all answers having body style=convt along with answers having body style as null 3. Relevant Uncertain Answers (RUA): Ranking answers by predicting values of missing attribute Low Recall Low Precision, infeasible Costly, infeasible

Outline n Introduction n QPIAD: Query Processing over Incomplete Autonomous Databases n Data Integration over Incomplete Autonomous Databases n Other Contributions n Conclusion

QPIAD System Architecture

RRUA: Generating Rewritten Queries n Restricted Relevant Uncertain Answers (RRUA) approach only retrieves only relevant incomplete tuples instead of retrieving all tuples as in AUA and RUA Consider a query Q:Body style=convt MakeModelYearPriceBody style Audia convt BMWz convt Porscheboxster convt …..…… Rewritten queries are based on the determining set from AFD for Body style: Model ~~> Body style:0.9 Q 1 :model=‘a4’ Q 2 :model=‘z4’ Q 3 :model=‘boxster’ Determining Attribute set(dtrSet) Base Result Set:RS(Q)

Learning Attribute Correlations n AFD: VIN ~~> Model where VIN is an Approximate Key(AKey) with high confidence n VIN will not be useful for query rewriting and feature selection since it will not be able to retrieve additional new tuples

RRUA: Ranking Rewritten Queries n All queries may not be equally good in retrieving relevant answers –“z4” model cars are more likely to be convertibles than a car with “a4” model n When database or network resources are limited, the mediator can choose to issue the top K queries to get the most relevant uncertain answers

Learning Value Distributions n Used to rank queries based on the determining set of attributes from the AFD for query attribute n We use Naïve Bayes Classifier with m- estimates with AFD as a feature selection step n Rank of a rewritten query Q i = P(A m =v m |t i ), where t i ε П dtrSet(Am) (RS(Q)) –Q 1 :model=‘a4’, R(Q 1 ) = P(bodystyle=convt|model=a4) = 0.4 –Q 2 :model=‘z4’, R(Q 2 ) = P(bodystyle=convt|model=z4)= 1.0 –Q 3 :model=‘boxster’, R(Q 3 ) = P(bodystyle=convt|model=boxster)=0.7 R(Q 2 ) > R(Q 3 ) > R(Q 1 ) n Relevant uncertain answers are ranked based on the rank of the rewritten query that retrieved it

Combining AFDs and Classifiers n More than one AFD may exist for some attributes n Experimented with several approaches: –Only best-AFD having highest confidence –All attributes ignoring AFDs –Hybrid One-AFD –Ensemble of classifiers

Empirical Evaluation of QPIAD n Test Databases: AutoTrader database containing 100K tuples and Census database from UCI Repository containing 50K tuples n Oracular study: To evaluate the effectiveness of our system against a ground truth, we artificially insert missing values in 10% of the tuples within these databases

RRUA vs AUA vs RUA

Precision over Top K Tuples

Ranking the Rewritten Queries Cars database Census database

Robustness of QPIAD

User Relevance Issues with QPIAD n When the query processor presents incomplete tuples, it becomes a recommender system For a query Q:year=2000 n How to convince users into believing the system results? MakeModelYearPriceMileage HondaCivicnull Explanation We have determined that this car’s year is 60% likely to 2000 based on price=15000 and mileage=18000

Outline n Introduction n QPIAD: Query Processing over Incomplete Autonomous Databases n Data Integration over Incomplete Autonomous Databases n Other Contributions n Conclusion

Leveraging Correlations between Data Sources Mediator:GS(Make,Model,Year,Price,Mileage,Bodystyle) Q:Body style=coupe

Correlated Source and Maximum Correlated Source n Consider four sources with schema: –S1(Make,Model,Year,Price) –S2(Engine,Drive,Bodystyle), AFD: {Engine, Drive} -> Body style confidence 0.7 –S3(Make,Model,Body style) AFD: Model -> Body style confidence 0.8 –S4(Make,Price,Body style) AFD: {Make, Price} -> Body Style confidence 0.6 –Mediator global schema GS(Make,Model,Year,Price, Bodystyle, Engine, Drive) n S3 and S4 are correlated sources with S1 on Body style attribute n S3 is the maximum correlated source for S1 on Body style attribute

Retrieving Relevant Uncertain Answers from CarsDirect.com Consider a query Q:body style = coupe(GS) n Cars.com has an AFD: Model ~~> Body style(0.9) n Cars.com is the maximum correlated source for CarsDirect.com which doesn’t support Body style but supports Model attribute MakeModelYearPriceBody style HondaAccord coupe FordMustang coupe AcuraLegend coupe BMW coupe Q 1 :model=Accord Q 2 :model=Mustang Q 3 :model=Legend Q 4 :model=325

Empirical Evaluation of using Correlation between Data Sources n We consider a mediator performing data integration over three sources: Cars.com, Yahoo! Autos and CarsDirect.com n Yahoo! Autos and CarsDirect.com do not allow querying on body style but when the tuples are retrieved we can check the body style attribute to determine if the tuple retrieved has the body style specified in the query n Evaluation using attribute correlations and value distributions learned from Cars.com for 5 test queries on body style attribute

Retrieving Relevant Answers using Correlations from Cars.com

Handling Joins over Incomplete Autonomous databases n Mediator performing data integration across two sources: –Source S1 is incomplete –Source S2 is complete SourceLocal Schema S1Cars(Make,Model,Year,Price) S2Review(Model,Ratings) Mediator View UsedCars(Make,Model,Year,Price,Ratings) :- Cars(Make,Model,Year,Price), Review(Model, Ratings)

Issues in Handling Joins n Performing joins over probabilistic databases will lead to a disjunction in join results n Consider joining uncertain tuples from the two sources: MakeModelYearPrice Hondanull [0.6 Civic] [0.4 Accord] ModelRatings Civic5 Accord4 MakeModelYearPriceRatings HondaCivic HondaAccord or Approximation

Handling Join Queries n Q:σ Make=Honda (UsedCars) n Assume AFDs: {Make,Year} ~~> Model, Model ~~> Make MakeModel(FK)YearPrice HondaOdyssey HondaAccord Hondanull nullAccord ToyotaCamry Model(PK)Ratings Civic5 Corolla4 Accord4 Altima3 Camry5 Odyssey3 HondaOdyssey HondaAccord nullAccord Hondanull Q 1 : Model=Odyssey:R(Q 1 )=1 Q 2 : Model=Accord:R(Q 2 )=1 0.6 Civic 0.4 Accord Queries on source S2 to join Q 3 :Model=Odyssey:R(Q 3 )=1 Q 4 :Model=Accord:R(Q 4 )=1 Q 5 :Model=Civic:R(Q 5 )=0.6

Experimental Results Joins

Outline n Introduction n QPIAD: Query Processing over Incomplete Autonomous Databases n Data Integration over Incomplete Autonomous Databases n Other Contributions n Conclusion

QUIC: Querying under Imprecision and Incompleteness Consider a query Q:model like Civic(Cars) n User might be interested in similar cars like “Accord”, ”Camry”, etc n Ranking results in presence of both similar and incomplete tuples IdMakeModelYearBody style 1HondaCivic2000Sedan 2HondaAccord2004Coupe 3ToyotaCamry2001Sedan 4Hondanull2004Coupe 5Hondanull2000Sedan 6HondaCivic2004Coupe 7BMW3series2001convt 8Toyotanull1999sedan

Other Contributions[*Collaboration with Garrett Wolf] n Handling multi-attribute selection queries for incomplete databases* n QUIC system for query processing under imprecision and incompleteness n Online learning of value distribution based on base result set to avoid sample biases

Conclusion n Thesis proposed a framework for query processing over incomplete autonomous web databases: –QPIAD: Query processing over incomplete autonomous databases –QPIAD: Data Integration over multiple incomplete data sources n Results of empirical evaluation on real world databases show that our system returns relevant answers with high precision while keeping the query processing cost manageable

Thank You!! Questions??