Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

CS188: Computational Models of Human Behavior

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.

Semantics Static semantics Dynamic semantics attribute grammars

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Representing and Querying Correlated Tuples in Probabilistic Databases

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.

Fast Algorithms For Hierarchical Range Histogram Constructions

SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Background Reinforcement Learning (RL) agents learn to do tasks by iteratively performing actions in the world and using resulting experiences to decide.

Dynamic Bayesian Networks (DBNs)

Efficient Query Evaluation on Probabilistic Databases

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

ISBN Chapter 3 Describing Syntax and Semantics.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.

Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.

An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS

Bayesian Network Representation Continued

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Describing Syntax and Semantics

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Information Theory and Security

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.

CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.

EGR 2261 Unit 5 Control Structures II: Repetition  Read Malik, Chapter 5.  Homework #5 and Lab #5 due next week.  Quiz next week.

Database Management 9. course. Execution of queries.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

The Relational Model: Relational Calculus

Querying Structured Text in an XML Database By Xuemei Luo.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 Relational Algebra and Calculas Chapter 4, Part A.

Relational Algebra.

8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.

1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.

Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

Introduction on Graphic Models

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.

Today Graphical Models Representing conditional dependence graphically

Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.

Software Testing.

Database Management System

Inference in Bayesian Networks

A paper on Join Synopses for Approximate Query Answering

Probabilistic Data Management

Chapter 12: Query Processing

CAP 5636 – Advanced Artificial Intelligence

Probabilistic Databases

Presentation transcript:

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη Ηράκλειο, 22/05/2012

Slide 2 / 84 Contents 1. Motivation 2. Previous Work 3. Approach 4. Markov Sequence 5. Formal Semantics 6. Operator Design 7. Query Planning Algorithm 8. Results 9. Feature Work

Slide 3 / 84 Motivation Correlated probabilistic streams occur in large variety of applications including sensor networks, information extraction and pattern/event detection. Probabilistic data streams are highly correlated in both time and space. Impact final query results

Slide 4 / 84 Motivation A habitat monitoring application: Query example: Compute the likelihood hat a nest was occupied for all 7 days in a week. Other issues: Handle high-rate data streams efficiently and produce accurate answers to continuous queries in real time Query semantics become ambiguous since they are dealing with sequences of tuples and not set of tuples.

Slide 5 / 84 Previous Work Probabilistic databases (Mystiq,Trio,Orion,MayBMS) focus on static data and not streams, also cannot handle complex correlations Lahar, Caldera are applicable to probabilistic streams, but focus on pattern identification queries. Unable to represent the types of correlations that probabilistic streams exhibit Can not be applied directly because of their complexity

Slide 6 / 84 Approach Observation: Many naturally occurring probabilistic streams are both structured and Markovian in nature  Structured: Same set of correlations and independences repeat across time  Markovian: The values of variables at time t + 1 are independent of those at time t − 1 given their values at time t  Design compact, lightweight data structures to represent and modify them  Enable incremental query processing using the iterator framework A Markov sequence as a DGM

Slide 7 / 84 Approach Probabilistic Databases Probabilistic Databases exhibit two types of uncertainties:  Tuple existence (captures the uncertainty whether a tuple exists in a database)  Attribute value (captures the uncertainty about the value of an attribute)  Equivalence between Probabilistic Databases and Directed Graphical Models:  nodes in the graph: denote the random variables  edges: correspond to the dependencies between the random variables (dependencies quantified by a conditional probability distribution function (CPD) for each node that describes how the value of that node depends on the value of its parent)

Slide 8 / 84 Approach How uncertainties are expressed: Tuple existence uncertainty can be modeled by using a binary random variable for each tuple, that takes value 1 if the tuple belongs to the database and 0 otherwise. Attribute value uncertainty can be modeled by using a random variable to represent the value of the attribute.A pdf on the random variable can be used to capture any form of uncertainty associated with the attribute.

Slide 9 / 84 Approach System’s Description

Slide 10 / 84 Approach System’s Description Input:a query on probabilistic streams Output: query’s results Steps: 1. Query conversion to a probabilistic sequence algebra expression 2. Query plan construction by instantiating each of the operators with the schemas of their input sequences 3. Each operator executes its schema routine and computes its output schema 4. Check the input to the projection and determine if projection operator is safe. 5. Optimize query plan

Slide 11 / 84 Markov Sequence A probabilistic sequence S p, is defined over a set of named attributes S = {V 1,V 2 …V k }, also called its schema, where each V i is a discrete probabilistic value. A probabilistic sequence is equivalent to a set of deterministic sequences, where each deterministic sequence is obtained by assigning a value to each random variable. Two operators for conversion from probabilistic sequence to deterministic sequence:  MAP: returns the possible sequence that has the highest probability  ML: returns a sequence of items each of which has the highest probability for its time instant.

Slide 12 / 84 Markov Sequence Special case of a probabilistic sequence. Completely determined by specifying successive joint distributions for all time instants p(V t, V t+1 ) Efficient Representation of Markov sequences using a combination of the schema graph and the clique list:  Schema graph: graphical representation of the two step joint distribution that repeats continuously.  Clique list: the set of direct dependencies present between successive sets of random variables. Schema graph Clique List

Slide 13 / 84 Formal Semantics Possible World Semantics (+small modification for sequences) Operators: Select, Project, Join, Aggregate, Windowing MAP, ML: Convert a Markovian sequence into a deterministic sequence The set of Markov sequences is not closed under these operators, i.e., some operators return non-Markovian sequences (Projection, Windowing) Op is safe for input schema I, Op(I) is a Markovian sequence

Slide 14 / 84 Operator Design Operators treat the input Markov sequence as a data stream, operating on one tuple (an array of numbers) at a time and produce the output in the same fashion (passed to the next operator). Each operator implements two high level routines:  Schema routine: Output schema is computed based on the input schema  get next() routine: Operates on each of the input data tuples to compute the output tuples.

Slide 15 / 84 Operator Design Selection Steps: 1. Start with the DGM corresponding to the input schema 2. Add a new node corresponding to the exists variable to both time steps of the DGM 3. Connect this node to the variables that are part of selection predicate through directed edges (selection predicate X>Y) 4. Update the clique list of the schema to include the newly created dependencies. Generally: Add new boolean random variable to each slice. Always safe Schema Routine Add new boolean random variable to each slice.

Slide 16 / 84 get_next() routine:  determine the CPD of the newly created node  add it to the input tuple’s CPD list  return the new tuple The algorithm does not change the Markovian property of the sequence Safe operator. Operator Design Selection

Slide 17 / 84 Steps:  Remove the nodes that are not in the projection list - Eliminate operation on the graphical model- Update clique list with eliminations  Determine if new edges need to be added to the schema graph Operator Design Projection Generally: Eliminate all variables not of interest. Unsafe for schemas Schema Routine

Slide 18 / 84 get_next() routine: perform the actual variable elimination procedure to eliminate the nodes that are not required Projection is not safe for all input sequences, and in some cases, even if the input is a Markov Sequence, the output may not be a Markov sequence. Operator Design Projection

Slide 19 / 84 schema routine: concatenate the schemas of the two sequences in order to determine the resulting output schema (combine the schema graphs, and concatenate the clique lists) get_next() routine: concatenate the CPD lists of the two tuples, whenever both tuples have the same time value. a join can be computed incrementally, and is always safe. Operator Design Join

Slide 20 / 84 Schema routine: the output schema is just a single attribute corresponding to the aggregate get_next() routine: Case1: no selection predicates (G i =Agg(X 1,X 2 …X i )) dotted variables added to the DGM, the boxed nodes are eliminated and computing p(X i,G i ) as input tuples arrive. At the end of the sequence, we get p(X n,G n ), from which we can obtain p(G n ) by eliminating X n. Case2: with selection predicates a value X i contributes to the aggregate only if E i (exists attribute) is 1 and not otherwise Always safe operator Operator Design Aggregation

Slide 21 / 84 A sliding window aggregate query asks to compute the aggregate values over a window that shifts over the stream (characterized by the length of the window, the desired shift, and the type of aggregate) DGM for sliding window aggregate. G i denotes the aggregates to compute After eliminating the X i variables, make a clique list on the G i variables. Operator Design Sliding Window Aggregates The aggregate value for a sliding window influences the aggregates for all of the future windows in the stream Unsafe Operator

Slide 22 / Ignoring the dependencies between the aggregate values produced at different time instances 2. Computing the distribution over each aggregate value independently Splitting the sliding window DGM into separate graphical models (one for each window), run inference on each of them separately and compute the results Unmarked nodes are intermediate aggregates Operator Design Sliding Window Aggregates

Slide 23 / 84 Operator Design Tumpling Window Aggregates The length of the sliding window is equal to its shift  can compute exact answers in a few cases. For tumbling window aggregates, only eliminate only eliminate boxed nodes to obtain the obtain Markov sequence : Eliminating X 3 και Χ 5 is postponed to a later projection.

Slide 24 / 84 A pattern is a list of predicates on the attribute values, with the ordering of the predicates defining a temporal order on the sequence Example: (A = 3;B > 5;A < 3) is a pattern that looks for a sequence of time instants such that the value of attribute A is 3 in the first instant, the next B has value more than 5, and the following A has value less than 3 To compute the probability of a consecutive pattern, compute the product of the corresponding conditional distribution functions user specifies a threshold parameter: if we want a pattern with probability greater than 0.7, then each of the contributing CPDs must be at least 0.7. Operator Design Pattern

Slide 25 / 84 MAP operator takes in a Markov sequence and returns a deterministic sequence. It is usually the last operator in the query plan, and hence it does not have a schema routine. The get next() routine uses the dynamic programming based approach of Viterbi's algorithm. (Viterbi-style dynamic Programming) ML: at each time step, compute the probability distribution for each time instant from each tuple. Based on this eliminate the variables that are not required and determine the most likely values for the variables. Simply marginalize the joint distribution. Operator Design MAP-ML

Slide 26 / The user has the choice between using MAP or ML operators for converting the final probabilistic answer to a deterministic answer 2. Support for specifying sliding window parameters Query Evaluation Query Syntax

Slide 27 / 84 Query Planning Algorithm Unsafe Operators: Projection, Tumbling window operator reduced to a projection, window operator use approximate method Query Planning = Determining the correct position for Projection Strategy: Pull up projection operator until it is safe. If no safe position for projection, check its parent: ML  we can determine a safe plan, MAP  we cannot determine a safe plan plan

Slide 28 / For a given query, convert it to a probabilistic sequence algebra expression: Q 0 : SELECT_MAP MAX(X) FROM SEQ WHERE Y<20 MAP(G p (Π p Χ (σ p Υ<20 SEQ))) 2. Each operator then executes its schema routine and computes its output schema, which is used as the input schema for the next operator in the chain 3. Check the input to the projection, and determine if the projection operator is safe 4. If a projection-input pair is not safe, we pull up the projection operator through the aggregate and the windowing operators and continue with the rest of the query plan: MAP(Π p MAX(X) (G p (σ p Y<20 SEQ))) 5. If the operator after the projection is ML, then determine the exact answer. If it is a MAP operator, replace both the projection and the MAP operator with the approximate-MAP operator Query Planning Algorithm

Slide 29 / 84 Results Markov sequence generator: Generate Markov sequences for a given input schema Capturing and reasoning about temporal correlations is critical for obtaining accurate query answers Operators can process up to 500 tuples per second Get_next() query processing framework that exploits the structure in Markov sequences is much more efficient that previous generic approaches

Slide 30 / 84 Results The % error in query processing for various operators when temporal correlations are ignored. Q1: SELECT MAP Agg(A) FROM S Q3: SELECT MAP MAX(A) FROM S[size,size]

Slide 31 / 84 Feature Work Scalability improvement of the system by resorting to approximations with guarantees.

Slide 32 / 84 References B.Kanagal, A.Deshpande: “Efficient Query Evaluation over Temporally Correlated Probabilistic Streams.”