Presentation on theme: "Representing and Querying Correlated Tuples in Probabilistic Databases"— Presentation transcript:
1Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen Amol Deshpande
2Independent tuples model Tuple correlations Representing Dependencies outlineGeneral InfoIntroductionIndependent tuples modelTuple correlationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
3Issues with the use of probabilistic databases General infoHigh demand for storing uncertain dataA framework that can represent not only probabilistic tuples but also correlations among them to tackle these limitationsIssues with the use of probabilistic databases1) existent probabilistic databases make simplistic assumptions about the data that make it difficult to use them in applications that naturally produce correlated data2) Most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languagesYparxei megalh zhthsh gia thn apo8ukefsh abaibewn dedomenwnOI iparxouses baseis dedomenwn k;anoyn uperaplousteumenes upo8eseis gia ta dedomena (oti ta einai ane3arthta meta3h tous ) me apotelesma na mhn mporoume na tis xrisimopoihsoue ean ta dedomena dn einai ane3arthta meta3h tousEpiseis oi perissoteres pi8anotikes baseis dedomenwn mporoun na apantoun mono stis poio sunxes glwsses querry.Sto paper afto proteinoun ena framework to opoio mporei na anaparhsta kai pi8anotikes pleiades alla kai sxeseis meta3i tous.
4outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
5Introduction (1/2)Database research has primarily concentrated on how to store and query exact dataMany real-world applications produce large amounts of uncertain dataDatabases need to do more than simply store and retrieve; they have to help the user sift through the uncertainty and find the results most likely to be the answer.1)H erebna stis baseis dedomenwn exei sigentrw8ei sthn epo8hkeush bebaiwn dedomenwn2)Polles efarmoges paragoun abebaia dedomena kai se tetoies periptwseis oi baseis dedomenwn ektos tou na anaktoun kai na paragoun dedomena prepei na boi8oun ton xrhsth na briskei ta apotelesmata pou einai poio pi8ano na apanth8oun
6Introduction (2/2)Numerous approaches (models) proposed to handle uncertainty.However, most models make assumptions about data uncertainty that restricts applicability (they cannot easily model or handle dependencies and correlations among tuples)
7outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
8Independent tuples model(1/2) One of the most commonly used tuple-level uncertainty models, associates existence probabilities with individual tuples and assumes that the tuples are independent of each otherEstw Dp mia bash dedomenwn me sxeseis Sp (pou periexei thn pleiada s1 me pi8aotita 0.6 kai thn pleiada s2 me pi8anotita 0.5) kai Tp (pou periexei thn pleiada t1 me pi8anotita 0.4)Blepoume ston pinaka afto olous tous pi8anous kosmous kai h pi8anotita gia ka8e kosmo bgenei (me enwsh olwn twn pi8anotitwn) pollaplasiazontas apla tis pi8anotites ka8e pleiadas efoson einai ane3arthtes.Ean twra ektelesoume to querry panw stous pia8ous kosmous
9Independent tuples model (2/2) Evaluating a query via the set of possible worlds is clearly intractable as the number of possible worlds is very bigIntensional semantics guarantee results in accordancewith possible words semantics but are computationallyexpensive.Extensional semantics are computationally cheaper but do not guarantee results in accordance with the possible worlds semantics.Base tuples are independent of each other, the intermediate tuples that are generated during query evaluation are typically correlated
10outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
11Tuple correlations (1/2) As 8ewrisoume twra 4 set apo pi8anous kosmous pou proerxontai apo thn idia bash dedomenwn opws proigoumenos ala me diaforetikes e3arthseis pou endexomenos 8eloume na anaparasthsoume
12Tuple correlations (2/2) Although the tuple probabilities associated with s1, s2 and t1 are identical, the query results are drastically different across these four databases.Since both intensional and extensional semantics assume base tuple independence neither can be directly used to do query evaluation in such cases.->Parolou pou oi pi8anotites ton pleiadwn einai s1 s2 einai idies ta apotelesmata twn query einai teleiws diaforetika.-> oi me8odoi extensional kai intentional pou eipame parapanw einai mono gia ane3artites pleiades kai giafto ton logo den boroun na xrhshmopoihs8oun gia thn a3iologhsh tou querry.
13Independent tuples model Tuple correlations Query evaluation outlineGeneral InfoIntroductionIndependent tuples modelTuple correlationsRepresenting correlationsQuery evaluationExperimentsConclusions & Work to be done
14Representing correlations(1/3) Associate every tuple t with a Boolean valued random variable Xtf (X) is a function of a (small) set of random variables X, where <= f (X) <=1Associate with each tuple in the probabilistic database a random variableDefine factors on (sub)sets of tuple-based random variables toencode correlations.5) The probability of an instantiation of the database is given by the product of all the factors.
15Representing correlations(2/3) Suppose we want to represent mutual exclusivity between tuples s1 and t1. In particular, let us try to represent the possible worlds:
16Representing correlations(3/3) Suppose we want to represent positive correlation between t1 and s1. In particular, let us try to represent the possible worlds:
17Probabilistic graphical model representation A probabilistic graphical model is graph whose nodes represent random variables and edges represent correlations Complete Ind. Mutual Exclusivity Positive CorrelationXt1Xs1Xt1Xs1Xt1Xs1Xs2Xs2Xs2
18Probabilistic graphical model representation X1X2X3
19outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
20Query evaluation: basic idea Treat intermediate tuples as regular tuples.Carefully represent correlations between intermediate tuples, base tuples and result tuples to construct a probabilistic graphical model.Cast the probability computations resulting from query evaluation to inference in probabilistic graphical models.1)Oi endiameses pleiades prepei na xeiristoun san kanonikes2)Prosektika prepei na anaparastountai oi sxeseis meta3i tous gia na dimiourgi8ei to pi8anotiko grafiko mondelo
22Opou to Fq einai to set twn paragontwn pou xreiazontai gia thn dimiourgeia ths pleiadas t epagogika sto querry
23Query evaluation :example Probabilistic graphical model Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the databaseEquivalent problem in prob. graph. models: marginal probability computation.use inference algorithmsXs2Xs1Xt1Xi1Xi2Xr1
25Representing probabilistic relations Mia pleida borei na iparxei se polles e3arthseis. Sthn dikia tous ilopoihsh oi pleiades apo8ikebontai san komati mia sxeshs. Gia thn apo8ukeush ths abebaibeotitas xrhshmopoioun partitionEna partition apoteleite apo ena factor kai ena set apo anafores stis pleiades stis opoies oi tuxaies times einai ta orismata gia to sigekrimeno factorEktos apo ka8e pleiada t se mia sxesh iparxei episeis mia lista apo pointers sta partitions pou exoun anafores stis pleiades aftesSto sxhma: blepoume pou organonontai oi sxesies kai ta partitions gia thn bash me thn nxor e3arthshOi diakekomenes grames einai oi pointesapo ths pleiades sta partitions kai oi kanonikes einai oi anafores
26outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
27Experiments (1/3)Database contains 860 publications from CiteSeer [GBL98].Searched for publications for given (misspelt) author name.Naturally involves mutual exclusivity correlations
28Experiments (2/3)Ran experiments on randomly generated TPC-H dataset of size 10MB.The first bar on each query indicates the time it took to run the full query including all the database operations and the probabilistic computations.The second one indicates the time it took to run only the database operations using our Java implementation.
29Experiments(3/3)The result of running an average query over a synthetically generated dataset containing tuples
30outline General Info Introduction Independent tuples model Tuple correlationsProbabilistic graphical models & factored representationsRepresenting DependenciesQuery evaluationExperimentsConclusions & Work to be done
31conclusionsThere is an increasing need for database solutions for efficiently managing and querying uncertain data exhibiting complex correlation patterns.A simple and intuitive framework is presented, based on probabilistic graphical models, for explicitly modeling correlations among tuples in a probabilistic database
32Work to be doneProblem: Although conceptually the approach presented allows for capturing arbitrary tuple correlations, exact query evaluation over large datasets exhibiting complex correlations may not always be feasible.Future Considerations:Development of approximate query evaluation techniques that can be used in such casesDevelop disk-based query evaluation algorithms so that their techniques can scale to very large datasets.