LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

The Theory of Zeta Graphs with an Application to Random Networks Christopher Ré Stanford.
XDuce Tabuchi Naoshi, M1, Yonelab.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
High-Performance Complex Event Processing over Streams Eugene Wu, Yanlei Diao, ShariqRizvi Presented by Ming Li and Mo Liu Presented by Ming Li and Mo.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases
Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Student: Hsu-Yung Cheng Advisor: Jenq-Neng Hwang, Professor
Probabilistic Databases Amol Deshpande, University of Maryland.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Improving Signature Matching using Binary Decision Diagrams Liu Yang, Rezwana Karim, Vinod Ganapathy Rutgers University Randy Smith Sandia National Labs.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Cayuga: A General Purpose Event Monitoring System Mirek Riedewald Joint work with Alan Demers, Johannes Gehrke, Biswanath Panda, Varun Sharma (IIT Delhi),
Lexical Analysis Constructing a Scanner from Regular Expressions.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Assistant Professor - University of Washington PhD from MIT in February 2006 Advisors: Hari Balakrishnan and Mike Stonebraker Topic: distributed stream.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Event Detection and Notification in the World-Wide Sensor Web Magdalena Balazinska with Evan Welbourne, Garret Cole, Nodira Khoussainova, Julie Letchner,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
Better Algorithms for Better Computers
A Course on Probabilistic Databases
Constrained Hidden Markov Models for Population-based Haplotyping
Probabilistic Data Management
Approximate Lineage for Probabilistic Databases
Queries with Difference on Probabilistic Databases
Lecture 16: Probabilistic Databases
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Probabilistic Databases
Instructor: Aaron Roth
A Framework for Testing Query Transformation Rules
Finding Periodic Discrete Events in Noisy Streams
Probabilistic Databases with MarkoViews
CS249: Neural Language Model
Presentation transcript:

LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington

What is a Lahar? Lahar -- SIGMOD Christopher Re2 This is a Lahar May 18, 1980 ~ 8:27am… a few minutes later It’s a massive, fast stream of dirt(y data) Our system, Lahar, processes queries on massive, dirty streams of data

Event Queries Lahar -- SIGMOD Christopher Re 3 CB A D E  Motivating App: RFID  Event queries as Cayuga, Sase and Snoop  Complex sequences using projections, predicates,… Joe entered office 422 at t=8 Query: “Alert when Joe enters 422” i.e. Joe outside 422, inside 422

Challenges: Tracking Joe’s Location Lahar -- SIGMOD Christopher Re4 6 th Floor in CS building Blue ring is Joe’s Location Antennas

6 th Floor in CS building Challenges: Tracking Joe’s Location Lahar -- SIGMOD Christopher Re5 Blue ring is Joe’s Location Antennas Two Problems: 1.Missed Readings 2.Granularity Mismatch  Propose: infer location, keep probs & query with Lahar  Model Based View [Deshpande et al] of an HMM Lahar retains probabilities, achieves higher quality (P/R) and is still efficient.

Outline Lahar -- SIGMOD Christopher Re6  RFID streams to probabilistic streams  Lahar queries on probabilistic streams  Query algorithms: Regular and Extended Regular  Experiments

Tracking Joe’s Location Lahar -- SIGMOD Christopher Re7 Blue ring is ground truth Antennas 6 th Floor in CS building

Probabilities via particle filter Lahar -- SIGMOD Christopher Re8 Each orange particle is a guess of Joe’s location Blue ring is ground truth Antennas Particles guess many locations per timestep, so data are uncertain 6 th Floor in CS building

TagtLocP Joe Hall30.4 Hall40.2 Joe Hall30.2 Hall40.2 Sue7…… From particles to a probabilistic stream Lahar -- SIGMOD Christopher Re9 At(tag,loc) Query Particle Filter output via At – a model based view

( ) * 0.6 = 0.36 TagtLocP Joe Hall30.4 Hall40.2 Joe Hall30.2 Hall40.2 Sue7…… Semantics of the Model Lahar -- SIGMOD Christopher Re10 At(tag,loc) TagtLoc Joe7Hall4 Joe8422 Sue7… Prob = 0.2 * 0.6 * … “Joe enters t=8 A query q returns the probability that q is true at each time t possible stream (worlds) Probability outside 422 (in Hall3,Hall4)

Outline Lahar -- SIGMOD Christopher Re11  RFID streams to probabilistic streams  Lahar queries on probabilistic streams  Query algorithms: Regular and Extended Regular  Experiments

Lahar Queries by Example Lahar -- SIGMOD Christopher Re12 Alert when Joe is in hallway 4 and later in office 422 Inspired by Cayuga [Demers et al 2006, White et al 2007]

Lahar Queries by Example Lahar -- SIGMOD Christopher Re13 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007]

Lahar Queries by Example Lahar -- SIGMOD Christopher Re14 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Alert when Joe is in hallway 4, and immediately in office 422

Lahar Queries by Example Lahar -- SIGMOD Christopher Re15 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4Joe in 422 Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)

 Regular Queries (Efficient, streamable)  Alert when Joe enters 422  Extended Regular (Efficient, streamable)  Alert when anyone enters 422 A hierarchy of Lahar queries Lahar -- SIGMOD Christopher Re16

A hierarchy of Lahar queries Lahar -- SIGMOD Christopher Re17  Regular Queries (Efficient, streamable)  Alert when Joe enters 422  Extended Regular (Efficient, streamable)  Alert when anyone enters 422  Safe (Efficient, but not streamable)  Unsafe (Inefficient)

Outline Lahar -- SIGMOD Christopher Re18  RFID streams to probabilistic streams  Lahar queries on probabilistic streams  Query algorithms: Regular and Extended Regular  Experiments

Review: A non-probabilistic example Lahar -- SIGMOD Christopher Re19 Alert me when Joe enters 422 TagTLoc Joe7Hall 4 Joe8422 TagTLoc Joe7Hall 4 Joe8423 Accept at t = 8 {} {1} {2} {} {1} {} Final Joe in Hall4Joe in

… now with probabilities Lahar -- SIGMOD Christopher Re Final Joe in Hall4Joe in Accept t=8 with p = 0.3 Alert me when Joe enters 422 {} 1.0 {} 0.5, {1} 0.5 {} 0.65, {1} 0.05, {2} 0.3 Distribution on States TagTLocP Joe7Hall40.5 Joe

Lies in the preceding slides… (technical details) Lahar -- SIGMOD Christopher Re21  Richer predication: “Alert when Joe enters any office”  Translate query and input into an alphabet Final Joe in Hall4Joe in  Key Technical Detail:  Alphabet is small in data  Streamable  See paper for compilation

Extension to Extended regular Lahar -- SIGMOD Christopher Re22 “Alert when anyone enters 422”

Extension to Extended regular Lahar -- SIGMOD Christopher Re23  Algorithm:  (Obs1) suggests run automaton for each person  (Obs2) suggests multiply to get prob any is true Space = O(# persons), not # timesteps: can stream “Alert when anyone enters 422” (Obs 1) Each query is regular(Obs 2) disjoint sets of events Hence, probabilistically independent

Summary of Contributions  Regular Queries (Efficient, streamable)  Compiled to an automaton,streaming, O(1) space  Extended regular (Efficient, streamable)  Streaming with O(m) space, i.e. # of persons.  See paper for Markovian correlations, more sophisticated predication, complete compilation and static analysis algorithms  Safe (Efficient, but not streamable)  Unsafe (Inefficient, most #P-hard)

Outline Lahar -- SIGMOD Christopher Re25  RFID streams to probabilistic streams  Lahar queries on probabilistic streams  Query algorithms: Regular and Extended Regular  Experiments

Experimental Setup Lahar -- SIGMOD Christopher Re26  Quality: How is P/R affected by keeping probs?  52 objects, 352 locations, 10k sq. ft.  2x30min trace with 10 min break in between  Participants marked down true locations

Experimental Setup Lahar -- SIGMOD Christopher Re27  Quality: How is P/R affected by keeping probs?  52 objects, 352 locations, 10k sq. ft.  2x30min trace with 10 min break in between  Participants marked down true locations  “Alert when anyone enters a coffee room”  Baseline: Most Likely Estimate (MLE)  Each timestep/Each person: most likely location

Quality: Realtime – Improve over MLE? Lahar -- SIGMOD Christopher Re 28  Declare an event “true”, if its Pr > threshold  Vary threshold Precision Recall F1 10% improvement in F1

Performance: Is the cost too high? Lahar -- SIGMOD Christopher Re29 Synthetic Data – Same query

Related Work Lahar -- SIGMOD Christopher Re30  Event Queries – Deterministic  Cayuga, SASE, SnoopIB  Model-Based Views  BBQ, recently, Kanagal et al ICDE 08  Probabilistic Databases  Mystiq, Trio, MayBMS, Maryland, Purdue,MCDB  Particle Filters on HMMs  Doucet, Godsill

Conclusion Lahar -- SIGMOD Christopher Re31  Showed Lahar  Processed output of several inference tasks (HMMs)  Applies more generally than just RFID  Quality (F1) gains by keeping probability  Performance usable in real-time  Lots of concurrent tags  No indexing!

Lahar -- SIGMOD Christopher Re32

Overview of Regular Query Algorithm Lahar -- SIGMOD Christopher Re33 1. Compile an event query q 1. Automaton (A) over a language L 2. Mapping (M) events to subsets of L 2. Runtime – Input is set of events E 1. Map E into subsets of L via M 2. Maintain set of possible states of A Deterministic Probabilistic stays same distribution Size of distribution depends only on the query, q. NB: example to follow For details, see paper

Why are ER queries hard? Lahar -- SIGMOD Christopher Re34  Regular Queries ~ Regular Expressions  Mapping is non-trivial  Inspired by Cayuga [Demers et al. 06]  Queries have #P-combined complexity  Encode mDNF as regular expression  Intuition: n-sized automaton leads to  Extended regular ~ 1 NFA per/person  k persons implies O(k)-size automaton  Exponential cost When ER, can avoid blowup

Regular and Extended Regular Lahar -- SIGMOD Christopher Re35  Query is regular if no variable is shared between subgoals  Query is extended regular if any variable shared by two subgoals, is shared by all subgoals p is shared between subgoals

Correlations Lahar -- SIGMOD Christopher Re36

Sequencing by example Lahar -- SIGMOD Christopher Re37  Sequencing is parameterized [Cayuga] Time Semicolon means “the next event among those that match next goal” Semicolon is not “after”

Compilation by example Lahar -- SIGMOD Christopher Re38  Each goal “corresponds” to two letters:  move (m) – the query should advance  accept (a) – the next subgoal accepts Any other maps to empty set Final Does not contain Does contain

Subtle example.. Lahar -- SIGMOD Christopher Re39  What about: Any other maps to empty set Final Does not contain Does contain

CUT II Lahar -- SIGMOD Christopher Re40

Motivating Apps Lahar -- SIGMOD Christopher Re41  RFID apps  Diary and Active Calendar Application.  Alert if I go to a database meeting.  Supply chain  Alert if Mach 3 razors are being stolen  Many independent HMMs  Elder care [Intel/UW]  Alert if elder takes their medicine with water  Activity Recognition  Financial applications on predictive HMM  Alert if head-and-shoulders market

Compile Select and Filter Lahar -- SIGMOD Christopher Re42  Intuition: goal maps to two letters:  match (m) : matches filter  accept (a) : accepted by select Final Does not contain Does contain language and automaton are the same for both queries

Wrinkle in the language: Filter v. Selection Lahar -- SIGMOD Christopher Re43 “Alert next time Joe is in 502 after he is in 501” Time Yes No “Alert if the next place Joe is in after 501 is 502” At

Recap of Algorithms Lahar -- SIGMOD Christopher Re44  Regular Queries  Compiled them to an NFA, then used image  Data complexity O(1)  Extended regular  Several regulars multiplied together  Depends on number of distinct people in the data, not number of time steps.

Lahar -- SIGMOD Christopher Re45  Text1    Eculid   uclid    

Lahar Queries by Example Lahar -- SIGMOD Christopher Re46 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)

Quality: Archived – Improve over Viterbi? Lahar -- SIGMOD Christopher Re 47  Smoothing v. Viterbi (MAP)  Lahar tracks of Markovian Correlations  Viterbi leverages correlations for MAP estimate PrecisionRecallF1 Approx ~30% gain in F1