Probabilistic Databases

Probabilistic Databases
Garima (MT14006) Nikita Jain (MT14052)

Need to handle imprecise data by modeling it as probabilistic data !
Why? Most real databases contain data whose correctness is uncertain. This imprecision may occur from measurement errors (sensor data) , inherent ambiguity in natural-language text (information extraction) or high cost of data cleaning (business intelligence). In order to work with such data, there is a need to quantify the integrity of the data. This is achieved by using probabilistic databases. Need to handle imprecise data by modeling it as probabilistic data !

What? A probabilistic database management system, or PROBDMS, is a system that stores large volumes of probabilistic data and supports complex queries. “Diamonds in the dirt” Challenges Scaling large data volumes Perform probabilistic inference. The tuples of the uncertain data are correlated and based on this correlation, the data is annotated with a confidence score, which is interpreted as a probability. Applications: In sensors Information extraction In sensors: a probabilistic model could answer many queries with sufficient confidence without needing to acquire additional readings → Saves battery life of sensor Information extraction: noise while collecting data → PROBDMS best for storing and processing such data

Facet 1: Semantics & Representation
Semantics: A probabilistic DB is probability space (discrete) over possible contents of DB PDB=(W,P); W={I1, I2,....,In} where I:possible instances called possible worlds and P is prob of its occurrence. P: W ->[0,1] One random variable for each possible tuple whose values are 0 or 1 -> probabilistic database is a joint probability distribution over the values of these random variables Representation Formalisms: BID (Block-independent-disjoint) Concise All representation formalisms are, at their core, an instance of database normalization: they decompose a probabilistic database with correlated tuples into several BID tables. Lineage The lineage of a tuple is an annotation that defines its derivation. Lineage is used both to represent probabilistic data, and to represent query results. With lineage, user feedback on correctness of results can be traced back to the sources of the relevant data, allowing unreliable sources to be identified. 0 means the record isn’t present and 1 means that it is present. BID, if the set of all possible tuples can be partitioned into blocks such that tuples from the same block are disjoint events, and tuples from distinct blocks are independent

Facet 2 : Query Evaluation
Safety Safe queries: no need for a separate probabilistic inference step, output probabilities are computed inside the database engine, during normal query processing → Large performance improvements Safe plan: the relational plan that computes the output probabilities correctly Dichotomy of Query Evaluation For some queries, data complexity in PTIME (all safe queries), while others have #P-hard data complexity. it means that query optimizers need to make special efforts to identify and use safe queries Materialised views In its most simple formulation, there are a number of materialized views, for example, answers to previous queries, and the query is rewritten in terms of these views, to improve performance A query may be unsafe but after rewriting it in terms of views it may become a safe query, and thus is in PTIME. A safe plan allows probabilities to be computed in the relational algebra, by extending its operators to manipulate probabilities. There are multiple ways to extend them, the simplest is to assume all tuples to be independent: a join that combines two tuples computes the new probability as p1 p2 , and a duplicate elimination that replaces n tuples with one tuple computes the output probability as 1 − (1 − p1 ). . . (1 − pn ). A safe plan is by definition a plan in which all these operations are provably correct. The correctness proof (or safety property) needs to be done by the query optimizer, through a static analysis on the plan. query expression (which is small) and the database (which is large) be treated as two different inputs to the query evaluation problem → three different complexity measures: the data complexity (when the query is fixed), the expression complexity (when the database is fixed), and the combined complexity (when both are part of the input) There is no magic here, we don’t avoid the #P-hard problem, we simply take advantage of the fact that the main cost has already been paid when the view was materialized.

Facet 3 : User Interface How to best present the set of possible query answers to the user Ranking and Top-k Query Answering system returns all possible answer tuples and their probabilities, rank these tuples, and restrict them to the top k. Aggregates over Imprecise Data Value aggregates: interpreted as expected value Predicate aggregates: one needs to compute the entire density function of the random variable represented by the aggregate more difficult than computing the expected value One way to rank tuples is in decreasing order of their output probabilities..Often, however, there may be a user-specified order criteria, and then the system needs to combine the user’s ranking scores with the output probability value aggregates, as in for each company return the sum of the profits in all its units, and predicate aggregates, as in return those companies having the sum of profits greater than 1M. Value aggregates: interpreted as expected value For instance, the complexities of computing sum and count aggregates over a column are the same as the complexities of answering the same query without the aggregate, such as where all possible values of the column are returned along with their probabilities

THANK YOU!!

References https://en.wikipedia.org/wiki/Probabilistic_database

Probabilistic Databases

Similar presentations

Presentation on theme: "Probabilistic Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Databases

Similar presentations

Presentation on theme: "Probabilistic Databases"— Presentation transcript:

Similar presentations

About project

Feedback