Probabilistic Databases

Slides:

Advertisements

Similar presentations

Uncertainty in Data Integration Ai Jing

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

Representing and Querying Correlated Tuples in Probabilistic Databases

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

CS 599 – Spatial and Temporal Databases Realm based Spatial data types: The Rose Algebra Ralf Hartmut Guting Markus Schneider.

1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.

Efficient Query Evaluation on Probabilistic Databases

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Data at the Core of the Enterprise. Objectives  Define of database systems  Introduce data modeling and SQL  Discuss emerging requirements of database.

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Data at the Core of the Enterprise. Objectives  Define of database systems.  Introduce data modeling and SQL.  Discuss emerging requirements of database.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Querying Structured Text in an XML Database By Xuemei Luo.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.

1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)

1 Relational Algebra and Calculas Chapter 4, Part A.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.

Formal Specification.

More SQL: Complex Queries,

How To Build a Compressed Bitmap Index

A Course on Probabilistic Databases

Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)

Knowledge and reasoning – second part

Relational Algebra Chapter 4 1.

Appendix A: Probability Theory

Chapter 2: Intro to Relational Model

Relational Algebra Chapter 4, Part A

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Probabilistic Data Management

Queries with Difference on Probabilistic Databases

February 7th – Exam Review

Lecture 16: Probabilistic Databases

Relational Algebra 1.

Lecture 12: Data Wrangling

Rank Aggregation.

LECTURE 3: Relational Algebra

Relational Algebra Chapter 4 1.

Relational Algebra Chapter 4 - part I.

Dr. Awad Khalil Computer Science Department AUC

Relational Algebra Chapter 4, Sections 4.1 – 4.2

More SQL: Complex Queries, Triggers, Views, and Schema Modification

This Lecture Substitution model

Knowledge and reasoning – second part

Chapter 2: Intro to Relational Model

Chapter 2: Intro to Relational Model

Chapter 8 Advanced SQL.

This Lecture Substitution model

Building Queries using the Principle of Simplest Query (POSQ)

Dr. Awad Khalil Computer Science Department AUC

This Lecture Substitution model

Relational Algebra & Calculus

Probabilistic Ranking of Database Query Results

Relational Algebra Chapter 4 - part I.

Presentation transcript:

Probabilistic Databases Garima (MT14006) Nikita Jain (MT14052)

Need to handle imprecise data by modeling it as probabilistic data ! Why? Most real databases contain data whose correctness is uncertain. This imprecision may occur from measurement errors (sensor data) , inherent ambiguity in natural-language text (information extraction) or high cost of data cleaning (business intelligence). In order to work with such data, there is a need to quantify the integrity of the data. This is achieved by using probabilistic databases. Need to handle imprecise data by modeling it as probabilistic data !

What? A probabilistic database management system, or PROBDMS, is a system that stores large volumes of probabilistic data and supports complex queries. “Diamonds in the dirt” Challenges Scaling large data volumes Perform probabilistic inference. The tuples of the uncertain data are correlated and based on this correlation, the data is annotated with a confidence score, which is interpreted as a probability. Applications: In sensors Information extraction In sensors: a probabilistic model could answer many queries with sufficient confidence without needing to acquire additional readings → Saves battery life of sensor Information extraction: noise while collecting data → PROBDMS best for storing and processing such data

Facet 1: Semantics & Representation Semantics: A probabilistic DB is probability space (discrete) over possible contents of DB PDB=(W,P); W={I1, I2,....,In} where I:possible instances called possible worlds and P is prob of its occurrence. P: W ->[0,1] One random variable for each possible tuple whose values are 0 or 1 -> probabilistic database is a joint probability distribution over the values of these random variables Representation Formalisms: BID (Block-independent-disjoint) Concise All representation formalisms are, at their core, an instance of database normalization: they decompose a probabilistic database with correlated tuples into several BID tables. Lineage The lineage of a tuple is an annotation that defines its derivation. Lineage is used both to represent probabilistic data, and to represent query results. With lineage, user feedback on correctness of results can be traced back to the sources of the relevant data, allowing unreliable sources to be identified. 0 means the record isn’t present and 1 means that it is present. BID, if the set of all possible tuples can be partitioned into blocks such that tuples from the same block are disjoint events, and tuples from distinct blocks are independent

Facet 2 : Query Evaluation Safety Safe queries: no need for a separate probabilistic inference step, output probabilities are computed inside the database engine, during normal query processing → Large performance improvements Safe plan: the relational plan that computes the output probabilities correctly Dichotomy of Query Evaluation For some queries, data complexity in PTIME (all safe queries), while others have #P-hard data complexity. it means that query optimizers need to make special efforts to identify and use safe queries Materialised views In its most simple formulation, there are a number of materialized views, for example, answers to previous queries, and the query is rewritten in terms of these views, to improve performance A query may be unsafe but after rewriting it in terms of views it may become a safe query, and thus is in PTIME. A safe plan allows probabilities to be computed in the relational algebra, by extending its operators to manipulate probabilities. There are multiple ways to extend them, the simplest is to assume all tuples to be independent: a join that combines two tuples computes the new probability as p1 p2 , and a duplicate elimination that replaces n tuples with one tuple computes the output probability as 1 − (1 − p1 ). . . (1 − pn ). A safe plan is by definition a plan in which all these operations are provably correct. The correctness proof (or safety property) needs to be done by the query optimizer, through a static analysis on the plan. query expression (which is small) and the database (which is large) be treated as two different inputs to the query evaluation problem → three different complexity measures: the data complexity (when the query is fixed), the expression complexity (when the database is fixed), and the combined complexity (when both are part of the input) There is no magic here, we don’t avoid the #P-hard problem, we simply take advantage of the fact that the main cost has already been paid when the view was materialized.

Facet 3 : User Interface How to best present the set of possible query answers to the user Ranking and Top-k Query Answering system returns all possible answer tuples and their probabilities, rank these tuples, and restrict them to the top k. Aggregates over Imprecise Data Value aggregates: interpreted as expected value Predicate aggregates: one needs to compute the entire density function of the random variable represented by the aggregate more difficult than computing the expected value One way to rank tuples is in decreasing order of their output probabilities..Often, however, there may be a user-specified order criteria, and then the system needs to combine the user’s ranking scores with the output probability value aggregates, as in for each company return the sum of the profits in all its units, and predicate aggregates, as in return those companies having the sum of profits greater than 1M. Value aggregates: interpreted as expected value For instance, the complexities of computing sum and count aggregates over a column are the same as the complexities of answering the same query without the aggregate, such as where all possible values of the column are returned along with their probabilities

THANK YOU!!

References https://en.wikipedia.org/wiki/Probabilistic_database http://cacm.acm.org/magazines/2009/7/32095-probabilistic-databases-diamonds-in-the-dirt/fulltext