Probabilistic Data Management

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Efficient Query Evaluation on Probabilistic Databases

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Recent Development on Elimination Ordering Group 1.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.

Efficient Processing of Top-k Spatial Preference Queries

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Computing and Compressive Sensing in Wireless Sensor Networks

The Variable-Increment Counting Bloom Filter

Probabilistic Data Management

Probabilistic Data Management

Chapter 15 QUERY EXECUTION.

Probabilistic Data Management

Probabilistic Data Management

Chapter 4: Probabilistic Query Answering (2)

Probabilistic Data Management

Lecture 16: Probabilistic Databases

Random Sampling over Joins Revisited

Probabilistic Data Management

Sequential Data Cleaning: A Statistical Approach

Implementation of Relational Operations

Probabilistic Databases

Range Queries on Uncertain Data

D. ZeinalipourYazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V

Computational Advertising and

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Probabilistic Data Management Chapter 11: Data Quality in Probabilistic Databases (1)

Objectives In this chapter, you will: Learn how to identify inconsistencies in probabilistic databases Discover how to clean uncertain data with quality guarantees Become familiar with techniques of resolving inconsistencies in probabilistic databases under possible worlds semantics Know how to obtain consistent query answers from inconsistent probabilistic databases

Cleaning Uncertain Data with Quality Guarantees Cheng et al., Very Large Data Bases (VLDB), 2008

Background In many emerging applications, data are inherently uncertain Sensor data like temperature, humidity, and wind speed Location values collected by GPS In biometric database, attributes of feature vectors are not exact The ambiguity of query answers in the uncertain database constitutes the notion of query quality i.e., how good a query answer is

Related Work Probabilistic database Query types x-tuple E.g. a alternative E.g. a1, a2 Query types Range query MAX query Data cleaning in Possible worlds of the probabilistic database probabilistic database results of MAX query partially cleaned database

Problem Definition - Framework Upon users' query request, provide query answers With query answers, evaluate query quality and derive a set, X, of optimal x-tuples to be cleaned within a budget constraint C

Problem Definition – Framework (cont'd) Quality evaluation Exponential numbers of possible worlds Evaluate over final query answer instead of possible worlds

Definition of Data Quality Possible World Semantics-Quality (PWS-quality) Essentially entropy Let {r1, r2, …, rd} be the set of d distinct PW-results Let qj be the probability that rj is the actual answer The PWS-quality S(D, Q) of a query Q in probabilistic database D is defined as:

Computation of PWS-Quality Straightforward method Consider every possible world The x-form of the PWS-quality Distribute qj into m x-tuples in the database D PRQ query: where and Y(x) = x log2 x existence probability of ti of x-tuple t k qualification probability of ti for Q

Computation of PWS-Quality (cont.) The x-form of the PWS-quality (cont.) Distribute qj into m x-tuples in the database D MAX query:

Data Cleaning Algorithm To improve the PWS-quality, we only need to consider cleaning x-tuples in the query answer set Optimization problem: gk – - g(k, D, Q) ck – cost of cleaning x-tuple t k C – budget constraint bk – 0 or 1 bit A variant of the 0/1 knapsack problem The time and space are O(CZ) and O(CZ2), respectively

Heuristics for Data Cleaning Random Randomly select x-tuples until budget C is exhausted MaxQP Compute qualification probability Pk for each x-tuple t k Choose x-tuples with the highest qualification probabilities to clean until budget C is exhausted Greedy Let fk = gk / ck Choose x-tuples with the highest fk to clean until budget C is exhausted

Consistent Query Answers in Inconsistent Probabilistic Databases ACM Conference on the Management of Data (SIGMOD), 2010

Outline Background Introduction Related Work Problem Definition Probabilistic/Uncertain Databases Inconsistent Databases Introduction Related Work Problem Definition Probabilistic Consistent Query Answering Experimental Results Summary

Background: Probabilistic/Uncertain Databases Data uncertainty is ubiquitous in many real-world applications Sensor networks Image data Location-based services Moving object search … Privacy preserving for medical data

Probabilistic Databases – Tuple Uncertainty Product ID Tuple ID Price ($) Prob. a a1 120 0.5 a2 80 0.4 b b1 90 0.8 Tuple Uncertainty Probabilistic databases x-tuples a, b alternatives a1, a2, b1 Possible worlds semantics PW1 = {a1, b1} PW3 = {a1} probabilistic database Possible Worlds, PWi Prob., Pr{PWi} PW1 = {a1, b1} 0.5  0.8 = 0.4 PW2 = {a2, b1} 0.4  0.8 = 0.32 PW3 = {a1} 0.5  (1-0.8) = 0.1 PW4 = {a2} 0.4  (1-0.8) = 0.08 PW5 = {b1} (1-0.5-0.4)  0.8 = 0.08 PW6 =  (1-0.5-0.4)  (1-0.8) = 0.02 6 possible worlds of the probabilistic database

Background: Inconsistent Certain Databases Inconsistent databases An inconsistent database contains tuples that may violate a number of integrity constraints Key constraint Functional dependency (FD) t1 = t2 RID restaurant table restaurant id

Background: Inconsistent Certain Databases (cont'd) Inconsistencies occur when: Integrating data from different data sources Collecting inaccurate data from real-world applications Solutions in the literature Data repair Consistent query answering Topics in this chapter: Consistent query answering in inconsistent probabilistic databases

Query: retrieve those restaurants with QoS [4, 5] Motivation Example restaurant t1 restaurant t2 Query: retrieve those restaurants with QoS [4, 5]

Query: retrieve those restaurants with QoS [4, 5] Motivation Example restaurant t1 restaurant t2 Web data news or rumors commercial data data sources Query: retrieve those restaurants with QoS [4, 5] integrated data Beijing, Summer, 2012

Motivation Example a probabilistic database a restaurant database Query: retrieve those restaurants with QoS [4, 5] inconsistency source ID area code zip code location status quality of service tjr.p t11 317 46201 A Open 5 0.2 t12 B Closed 4 0.4 t13 46202 3 0.1 t21 0.8 restaurant t1 restaurant t2 a probabilistic database a restaurant database the percentage of people agreeing on the record Beijing, Summer, 2012

inconsistent probabilistic database probabilistic database inconsistency Motivation Example RID SID AC Zip Loc. status QoS tjr.p t1 t11 317 46201 A Open 5 0.2 t12 B Closed 4 0.4 t13 46202 3 0.1 t2 t21 0.8 inconsistent probabilistic database probabilistic database Consistent Query Answers in an Inconsistent Probabilistic Database! possible worlds functional dependency: Beijing, Summer, 2012

Outline Summary Background Introduction Related Work Problem Definition Probabilistic Consistent Query Answering Experimental Results Summary

Introduction Data sources Inconsistencies in the collected data The crawled Web data from Internet News or rumors from personal blogs Data exchanged or bought from corporations Inconsistencies in the collected data Input typos Data expirations Subjective comments made by people

Related Work Inconsistent (certain) databases An inconsistent database contains tuples that violate integrity constraints Key constraint Functional dependency (FD) A repair of the database is to manipulate tuples with the minimal repair cost such that the resulting data become consistent X-repair: tuple deletions only S-repair: tuple insertions and deletions U-repair: attribute value updates

Related Work – X-Repair X-repair (under minimal repair semantics) Delete tuples t11 and t12 Delete tuples t11 and t13 RID SID AC Zip Loc. status QoS t1 t11 317 46201 A Open 5 t12 B Closed 4 t13 46202 3 t2 t21

Related Work – Consistent Query Answering (CQA) answers answers consistent query answers … … … … … … … … … … inconsistent certain database answers repair intersection minimal repairs

CQA in Inconsistent Probabilistic Databases Query: retrieve those restaurants with QoS [4, 5] X-repair (2 minimal repairs) Delete {t11, t12} Delete {t11, t13} We consider all-possible-repairs semantics rather than minimal repairs query answer:  query answer: {t12} RID SID AC Zip Loc. status QoS tjr.p t1 t11 317 46201 A Open 5 0.2 t12 B Closed 4 0.4 t13 46202 3 0.1 t2 t21 0.8 AC Zip Loc. 317 46201 A 46203 B ground truth

All-Possible-Repairs Semantics in Inconsistent Probabilistic Databases Query: retrieve those restaurants with QoS [4, 5] The all-possible-repairs semantics RID SID AC Zip Loc. status QoS tjr.p t1 t11 317 46201 A Open 5 0.2 t12 B Closed 4 0.4 t13 46202 3 0.1 t2 t21 0.8 AC Zip Loc. 317 46201 A 46203 B ground truth minimal repairs D1R and D2R

Probabilistic Consistent Query Answering CQA in the inconsistent probabilistic database Return consistent query answers satisfying query predicate, PQ, and consistent score predicate, PS, on all possible repairs Query Types PC-Range PC-Join PC-Topk

Probabilistic Consistent Query Answering (cont'd) Probabilistic consistent range query (PC-Range) Obtain objects satisfying range predicates and with consistent score, scoreR, greater than aR Probabilistic consistent join (PC-Join) Retrieve object pairs from two databases satisfying join predicates, and with consistent score, scoreJ, greater than aJ Probabilistic consistent top-k query (PC-Topk) Obtain k objects with the highest consistent scores, scoreT

Computation of Consistent Scores – PC-Range and PC-Join PC-Range (PC-Join) The consistent score of a tuple (pair) indicates the confidence that the tuple (pair) appears in the repaired possible worlds scoreR(tj)=tj.Prpw=Pr{tjrpw(D)}=tj.p(1-Pr{tj is in some rw(D)}) offline pre-computation

Definition of PC-Topk Queries Top-k queries in a consistent probabilistic database A probabilistic database D A preference function f(.) A probabilistic top-k query retrieves k tuples, tjr, such that their scores, w(D, tjr), are the largest, where a weight function J. Li, B. Saha, and A. Deshpande. A unified approach to ranking in probabilistic databases. In PVLDB, 2009.

Definition of PC-Topk Queries (cont'd) PC-Topk queries in an inconsistent probabilistic database A PC-Topk query retrieves k tuples, tjr, such that their consistent scores, scoreT(tjr), are the largest, where where Wr(DR) is the repair weight and

Probabilistic CQA answers answers probabilistic consistent query answers … … … … … … … … … … inconsistent probabilistic database answers repair aggregation all possible repairs

Challenge Time complexity Exponential numbers of repaired databases and possible worlds Inefficient to materialize all possible repaired databases and possible worlds for each repaired database … … … … … … … inconsistent probabilistic database exponential number of repaired databases exponential number of possible worlds

Basic Idea of Solutions model inconsistent tuples by inconsistency graphs transform the CQA problem to the one in repaired possible worlds derive the recursive functions for computing consistent scores design pruning techniques to quickly filter out false alarms (PC-Topk)

Inconsistency Graphs Inconsistency graph, Ginc RID SID AC Zip Loc. status QoS tjr.p t1 t11 317 46201 A Open 5 0.2 t12 B Closed 4 0.4 t13 46202 3 0.1 t2 t21 0.8 Inconsistency graph, Ginc Vertex set, V Each vertex, vjr, corresponds to a tuple, tjr, in the database Edge set, E Two vertices have a connecting edge between them, iff their corresponding tuples are inconsistent A repair of database D is equivalent to deleting vertices in Ginc such that no edges are left

Repair Worlds repair world 1 repair world 2 repair world 3

Problem Reduction CQA over inconsistent probabilistic databases can be reduced to the problem in the repaired possible worlds … … … recursive functions consistent scores … inconsistent probabilistic database possible worlds pw(D) repair worlds rw(D) repaired possible worlds rpw(D) = pw(D) - rw(D)

Computation of Consistent Scores for PC-Topk Consistent Score of PC-Topk The consistent score, scoreT(tj), of a tuple tj indicates the score that tj is in the top-k results of the repaired possible worlds rpw1 rpw2 rpw4 rpw3 tj is in top-k results? Y w(tj,rpw1)=0.3 repaired possible worlds tj is in top-k results? N scoreT(tj) = 0.3+0.3= 0.6 S tj is in top-k results? N tj is in top-k results? Y w(tj,rpw4)=0.3 Probabilistic Consistent Top-k (PC-Topk) Query: retrieve k objects, tj, with the highest scoreT(tj)

Consistent Scores for PC-Topk We rewrite scoreT(tj) as: where recursive function the probability that there are (i-1) out of (j-1) tuples with rank higher than tj in rpw(D)

PC-Topk Pruning PC-Topk Pruning Heuristics To obtain bounds of consistent scores at a low cost Set a threshold t to the k-th largest score lower bound among tuples we have seen so far Any tuple having score upper bound smaller than t can be safely pruned Let Wm be the m-th largest existence probability in Tj-1

Experimental Evaluation Experimental Settings Real/synthetic data sets Iceberg sighting data set (IIP): existence probabilities are assigned according to witness probabilities Uniform and skew data Inconsistencies are injected by randomly selecting tuple pairs Parameters: Query range [emin, emax], dimensionality d, data size N, join similarity threshold e, percentage of inconsistent tuples g, parameter k Measures time cost (PC-Range, PC-Join, and PC-Topk) speed-up ratio (PC-Topk): compared with a baseline method Baseline: Sequentially scan data and calculate the consistent score

Performance of PC-Topk PC-Topk time cost vs. k PC-Topk speed-up ratio vs. k dimensionality d = 2, data size N = 30K, percentage of inconsistent tuples g = 0.1%

Summary Consistent query answering (CQA) in inconsistent probabilistic databases Reduce the problem to the one in the repaired possible worlds Derive recursive functions to compute consistent scores Provide effective filtering methods to reduce the search space