Indexing Correlated Probabilistic Databases Bhargav Kanagal & Amol Deshpande University of Maryland.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Exact Inference in Bayes Nets
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.
Dynamic Bayesian Networks (DBNs)
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
B+-tree and Hashing.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
1 Online Balancing of Range-Partitioned Data with Applications to P2P Systems Prasanna Ganesan Mayank Bawa Hector Garcia-Molina Stanford University.
Chapter 4: Transaction Management
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Probabilistic Databases Amol Deshpande, University of Maryland.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Range Queries in Distributed Networks Jie Gao Stony Brook University Dagstuhl Seminar, March 14 th,
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management 9. course. Execution of queries.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Universität Stuttgart Institute of Parallel and Distributed Systems (IPVS) Universitätsstraße 38 D Stuttgart Scalable Processing of Trajectory-Based.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
DIST: A Distributed Spatio-temporal Index Structure for Sensor Networks Anand Meka and Ambuj Singh UCSB, 2005.
1 Shape Segmentation and Applications in Sensor Networks Xianjin Xhu, Rik Sarkar, Jie Gao Department of CS, Stony Brook University INFOCOM 2007.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
CS4432: Database Systems II Query Processing- Part 2.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Today Graphical Models Representing conditional dependence graphically
CPS216: Data-intensive Computing Systems
Parallel Databases.
Database Management System
Probabilistic Data Management
Chapter 15 QUERY EXECUTION.
Spatial Online Sampling and Aggregation
CSCI 5822 Probabilistic Models of Human and Machine Learning
Indexing and Hashing Basic Concepts Ordered Indices
Probabilistic Databases
Junction Trees 3 Undirected Graphical Models
Presentation transcript:

Indexing Correlated Probabilistic Databases Bhargav Kanagal & Amol Deshpande University of Maryland

Introduction Correlated Probabilistic data generated in many scenarios Data Integration [AFM06]: Conflicting information best captured using “mutual exclusivity” Information Extraction [JK+06]: Annotations on consecutive text segments are strongly correlated Bill can be reached at annotation: first name prob: 80% annotation: phone number prob: 90% High positive correlation Sensor Networks [KD08]: Very strong spatio-temporal correlations TIMEUSERMODE (INFERRED) Location (INFERRED) 5pmJohn Walking: 0.9 Car: 0.1 5pmJane Walking: 0.9 Car : 0.1 5:05pmJohn Walking: 0 Car: 1 Mystiq/Lahar, PrDB, MayBMS, MCDB, BayeStore... Represent complex correlations and query them But, do not scale to large numbers of arbitrary correlations

Challenges with arbitrary correlations The problem with chains of correlations – Simple queries involving few variables might need to access and manipulate the entire database Searching through the vast numbers of correlations is inefficient 2. What is the average temperature measured by the sensor ? (AGGREGATION query) 2. What is the average temperature measured by the sensor ? (AGGREGATION query) 1. What is the likelihood that the sensor is working at time t 100 given that it was working at time t 0 ? (WHAT-IF/INFERENCE query) TIMETEMPERATUREHUMIDITYSTATUS (inferred) t0t0 WORKING: 0.9 FAILED: 0.1 t1t1 WORKING: 0.9 FAILED: 0.1 ……… … ……… … t 100 WORKING: 0.4 FAILED: 0.6

Problem Given: Probabilistic database with millions of tuples and a large number of arbitrary correlations Need to support scalable execution of inference queries and aggregation queries – accessing as few data tuples from disk as possible – doing as few probability computations as possible We build a novel data structure – INDSEP that provides an index and a caching mechanism for efficient query processing over probabilistic databases

Outline 1.Background 1.Probabilistic Databases as Junction Trees 2.Querying Junction Trees 2.INDSEP 1.Shortcut Potentials 2.Overview of Data structure 3.Querying using INDSEP 1.Inference queries 2.Aggregation queries 4.Building INDSEP 5.Handling Updates 6.Results

Background: ProbDBs as Junction Trees (SD07) idXexists ? 1? 2? 31 4? Tuple Uncertainty Attribute Uncertainty idYZ Correlations Perfectly correlated idXexists ? 1a 2a 31 4e idXexists ? 1ba 2ca 3k1 4de 5n1 6o1 idYZ 1hi 2hf 3gj 4lj 5lm Junction Tree Concise encoding of joint probability distribution

Background: Junction trees Clique Separator p(a,b) p(a)p(c)p(a,c) p(c,d) Each clique and separator stores joint pdf Tree structure reflects Markov property Given a, b independent of c Given c, a independent of d CAVEAT: Our approach inherits the disadvantages of the junction tree technique: Restricted to ProbDBs that can be efficiently represented using a junction tree [i.e., low tree width models]

Query Processing on Junction trees Inference Queries  Determine probability distribution of the set of query variables p(a,b) p(a)p(c)p(a,c) p(c,d) Push summation inside and smartly eliminate

(Naïve) Hugin’s Algorithm {b, e, o} Keep query variables Keep correlations Remove others Keep query variables Keep correlations Remove others PIVOT Steiner tree + Send messages toward a given pivot node For ProbDBs ≈ 1 million tuples, not scalable (1)Searching for cliques is expensive: Linear scan over all the nodes is inefficient (2)Span of the query can be very large – almost the complete database accessed even for a 3 variable query

Outline 1.Background 1.Probabilistic Databases as Junction Trees 2.Querying Junction Trees 2.INDSEP 1.Shortcut Potentials 2.Overview of Data structure 3.Querying using INDSEP 1.Inference queries 2.Aggregation queries 4.Building INDSEP 5.Handling Updates 6.Results

INDSEP (Shortcut potentials) How can we make the inference query computation scalable ? Shortcut Potential (1)Boundary separators (2)Distribution required to completely shortcut the partition (3)Which to build ? (1)Boundary separators (2)Distribution required to completely shortcut the partition (3)Which to build ? Junction tree on set variables {a, b, d, e, l, n, o} Junction tree on set variables {a, b, d, e, l, n, o}

Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 1.Variables: {c, f, g, j, k}, {f, h, i} 2.Child Separators: p(f) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c,j,f), p(f)} 1.Variables: {c, f, g, j, k}, {f, h, i} 2.Child Separators: p(f) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c,j,f), p(f)} INDSEP - Overview 1.Variables: {a,b,..} {c,f,..} {j,n..o} 2.Child Separators: p(c), p(j) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c), p(c,j), p(j)} 1.Variables: {a,b,..} {c,f,..} {j,n..o} 2.Child Separators: p(c), p(j) 3.Tree induced on the children 4.Shortcut potentials of children: {p(c), p(c,j), p(j)} Obtained by hierarchical partitioning of the junction tree

Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 Query Processing using INDSEP {g} Search for ‘g’ using the variable set Extract the partition P 3 from disk Determine p(g) from the junction tree stored in P 3 P3P3 P3P3 {g} Number of disk accesses is only 3 Logarithmic (like) number of disk accesses for query processing Number of disk accesses is only 3 Logarithmic (like) number of disk accesses for query processing Recursion on the hierarchical index structure Isn’t this expensive ?

Query Processing using INDSEP Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 {b, e, o} {b} {b, e} {o} {b, e, c} {c, j} {j, o} 1.Construct Steiner tree on query rvs 2.Is shortcut potential sufficient ? 3.Recursively proceed along its children 4.Assemble the probabilities obtained from the children 1.Construct Steiner tree on query rvs 2.Is shortcut potential sufficient ? 3.Recursively proceed along its children 4.Assemble the probabilities obtained from the children {b, e, o}

Aggregation Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 Q = b + c + d + k + n + o {Q} {b, c, d} k k {n, o} Y 1 = b + c + d Y 2 = k Y 3 = n + o {Y 1, c} {c, Y 2, j} {j, Y 3 } {b+c, c, a} {d, a} “Push” the aggregate inside the index and exploit decomposability Currently working on supporting more complex SQL style queries {c, Y 2, j} {j, n, l} {l, o}

Outline 1.Background 1.Probabilistic Databases as Junction Trees 2.Querying Junction Trees 2.INDSEP 1.Shortcut Potentials 2.Overview of Data structure 3.Querying using INDSEP 1.Inference queries 2.Aggregation queries 4.Building INDSEP 5.Handling Updates 6.Results

Building INDSEP: Partitioning Subroutine: we use the linear tree partitioning algo of Kundu, Misra et al. [KM77] Input: node weighted tree, block size B Output: A Partitioning into connected components such that: (1)each component has size <= B (2)number of components is minimum. Linear in size of tree, pseudo-linear in B Linear in size of tree, pseudo-linear in B BOTTOM-UP ALGORITHM weight = size(p(cfg)) = 8 weight = size(p(a)) = 2 weight = size(p(jl)) = 4+ weight = size(p(cjf)) = 8+

Building INDSEP: Variable Renaming Each node stores all the variables present in its subtree - inefficient Rename variables so that we can store a range (not sets) Each partition will have – (min, max) e.g. [4, 7] – add-list e.g. {2} Mapping is stored in another relation, indexed by hash table Updates ? [Paper]  Starting from the leftmost partition assign numbers from 0 to the variables,  Increment the counter when new variables are seen  Overlapping variables fall into the add-list.

Update Processing [very briefly] We support two kinds of updates: – Updates to existing probabilities – Inserting new data [discussed in paper] On modifying a certain probability: – Every clique in the junction tree needs to be modified [Inefficient] – We develop a lazy algorithm: future queries take the burden of updating the index Root I1I1 I1I1 I2I2 I2I2 I3I3 I3I3 P1P1 P1P1 P2P2 P2P2 P6P6 P6P6 P5P5 P5P5 P4P4 P4P4 P3P3 P3P3 Suppose update at P 6 Load P 6 and update it Load I 3 and update shortcuts Load root and update shortcuts {k}

Outline 1.Background 1.Probabilistic Databases as Junction Trees 2.Querying Junction Trees 2.INDSEP 1.Shortcut Potentials 2.Overview of Data structure 3.Querying using INDSEP 1.Inference queries 2.Aggregation queries 4.Building INDSEP 5.Handling Updates 6.Results

Implementation [PrDB] System implemented in Java Disk: simulated in memory, array of disk blocks Each disk block is a serialized byte array with size BLOCK_SIZE Data is read and written in units of disk block insert into table1 values (‘t1’, 1, 2) uncertain(‘0 0.5; 1 0.5’); Tuple Uncertainty Tuple Uncertainty insert into table2 values (‘r1’, uncertain(‘xy 0.5; ab 0.5’),‘t1’); Attribute Uncertainty Attribute Uncertainty Correlations insert factor ‘ ; ’ in table1 on ‘t1.e’,‘t2.e’; Language for inserting new tuples and correlations into the probabilistic database

Experimental Setup Markov sequence DB [LR09, KD09] -2 million time steps ⇒ 4 million nodes -Periodic update at each time instant Event Monitoring: – 500,000 nodes – Generate arbitrary correlations in the database – Each node connected to k others k ∈ [1,5] – Random updates Queries & Updates: 4 Workloads based on span [Short: 20%, 40%, 60%, Long: 80%] (100 queries per workload) Between 2 and 5 random variables Similar workloads for Updates SPAN

Results INDSEP makes query processing highly efficient and scalable I/O cost for different workloads (Event) I/O cost for different workloads (Event) CPU cost for different workloads (Event) CPU cost for different workloads (Event) Notice that having just the index is not sufficient. We also need shortcut potentials Comparison Systems (1)No Index (2)With Index but no shortcut potentials (3)INDSEP NOTE: LOG scale

Results % inconsistent shortcut potentials (Markov Sequence) % inconsistent shortcut potentials (Markov Sequence) Time to build INDSEP (Event) Time to build INDSEP (Event) INDSEP construction is efficient Query Performance Results Note that the disk is currently simulated in memory.

Related Work DTrees (Darwiche et al.) – Design index for directed PGMs – Limited querying support Markov Chain Index (Letchner et al., [LR+09]) – Design index for a restricted correlation structure: single attribute Markov chains Indexing spatial probabilistic data (Tao et al., Singh et al.) – Based on R-trees – Attribute uncertainty with independence assumptions, i.e. no correlations

Future work Multiple queries – Sharing computation across queries Supporting approximations for high tree width ProbDBs in the index data structure Disconnections – We assume a connected junction tree, we would like to also support disconnected PGMs Thank you !!!