LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington
What is a Lahar? Lahar -- SIGMOD Christopher Re2 This is a Lahar May 18, 1980 ~ 8:27am… a few minutes later It’s a massive, fast stream of dirt(y data) Our system, Lahar, processes queries on massive, dirty streams of data
Event Queries Lahar -- SIGMOD Christopher Re 3 CB A D E Motivating App: RFID Event queries as Cayuga, Sase and Snoop Complex sequences using projections, predicates,… Joe entered office 422 at t=8 Query: “Alert when Joe enters 422” i.e. Joe outside 422, inside 422
Challenges: Tracking Joe’s Location Lahar -- SIGMOD Christopher Re4 6 th Floor in CS building Blue ring is Joe’s Location Antennas
6 th Floor in CS building Challenges: Tracking Joe’s Location Lahar -- SIGMOD Christopher Re5 Blue ring is Joe’s Location Antennas Two Problems: 1.Missed Readings 2.Granularity Mismatch Propose: infer location, keep probs & query with Lahar Model Based View [Deshpande et al] of an HMM Lahar retains probabilities, achieves higher quality (P/R) and is still efficient.
Outline Lahar -- SIGMOD Christopher Re6 RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Tracking Joe’s Location Lahar -- SIGMOD Christopher Re7 Blue ring is ground truth Antennas 6 th Floor in CS building
Probabilities via particle filter Lahar -- SIGMOD Christopher Re8 Each orange particle is a guess of Joe’s location Blue ring is ground truth Antennas Particles guess many locations per timestep, so data are uncertain 6 th Floor in CS building
TagtLocP Joe Hall30.4 Hall40.2 Joe Hall30.2 Hall40.2 Sue7…… From particles to a probabilistic stream Lahar -- SIGMOD Christopher Re9 At(tag,loc) Query Particle Filter output via At – a model based view
( ) * 0.6 = 0.36 TagtLocP Joe Hall30.4 Hall40.2 Joe Hall30.2 Hall40.2 Sue7…… Semantics of the Model Lahar -- SIGMOD Christopher Re10 At(tag,loc) TagtLoc Joe7Hall4 Joe8422 Sue7… Prob = 0.2 * 0.6 * … “Joe enters t=8 A query q returns the probability that q is true at each time t possible stream (worlds) Probability outside 422 (in Hall3,Hall4)
Outline Lahar -- SIGMOD Christopher Re11 RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Lahar Queries by Example Lahar -- SIGMOD Christopher Re12 Alert when Joe is in hallway 4 and later in office 422 Inspired by Cayuga [Demers et al 2006, White et al 2007]
Lahar Queries by Example Lahar -- SIGMOD Christopher Re13 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007]
Lahar Queries by Example Lahar -- SIGMOD Christopher Re14 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Alert when Joe is in hallway 4, and immediately in office 422
Lahar Queries by Example Lahar -- SIGMOD Christopher Re15 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4Joe in 422 Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)
Regular Queries (Efficient, streamable) Alert when Joe enters 422 Extended Regular (Efficient, streamable) Alert when anyone enters 422 A hierarchy of Lahar queries Lahar -- SIGMOD Christopher Re16
A hierarchy of Lahar queries Lahar -- SIGMOD Christopher Re17 Regular Queries (Efficient, streamable) Alert when Joe enters 422 Extended Regular (Efficient, streamable) Alert when anyone enters 422 Safe (Efficient, but not streamable) Unsafe (Inefficient)
Outline Lahar -- SIGMOD Christopher Re18 RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Review: A non-probabilistic example Lahar -- SIGMOD Christopher Re19 Alert me when Joe enters 422 TagTLoc Joe7Hall 4 Joe8422 TagTLoc Joe7Hall 4 Joe8423 Accept at t = 8 {} {1} {2} {} {1} {} Final Joe in Hall4Joe in
… now with probabilities Lahar -- SIGMOD Christopher Re Final Joe in Hall4Joe in Accept t=8 with p = 0.3 Alert me when Joe enters 422 {} 1.0 {} 0.5, {1} 0.5 {} 0.65, {1} 0.05, {2} 0.3 Distribution on States TagTLocP Joe7Hall40.5 Joe
Lies in the preceding slides… (technical details) Lahar -- SIGMOD Christopher Re21 Richer predication: “Alert when Joe enters any office” Translate query and input into an alphabet Final Joe in Hall4Joe in Key Technical Detail: Alphabet is small in data Streamable See paper for compilation
Extension to Extended regular Lahar -- SIGMOD Christopher Re22 “Alert when anyone enters 422”
Extension to Extended regular Lahar -- SIGMOD Christopher Re23 Algorithm: (Obs1) suggests run automaton for each person (Obs2) suggests multiply to get prob any is true Space = O(# persons), not # timesteps: can stream “Alert when anyone enters 422” (Obs 1) Each query is regular(Obs 2) disjoint sets of events Hence, probabilistically independent
Summary of Contributions Regular Queries (Efficient, streamable) Compiled to an automaton,streaming, O(1) space Extended regular (Efficient, streamable) Streaming with O(m) space, i.e. # of persons. See paper for Markovian correlations, more sophisticated predication, complete compilation and static analysis algorithms Safe (Efficient, but not streamable) Unsafe (Inefficient, most #P-hard)
Outline Lahar -- SIGMOD Christopher Re25 RFID streams to probabilistic streams Lahar queries on probabilistic streams Query algorithms: Regular and Extended Regular Experiments
Experimental Setup Lahar -- SIGMOD Christopher Re26 Quality: How is P/R affected by keeping probs? 52 objects, 352 locations, 10k sq. ft. 2x30min trace with 10 min break in between Participants marked down true locations
Experimental Setup Lahar -- SIGMOD Christopher Re27 Quality: How is P/R affected by keeping probs? 52 objects, 352 locations, 10k sq. ft. 2x30min trace with 10 min break in between Participants marked down true locations “Alert when anyone enters a coffee room” Baseline: Most Likely Estimate (MLE) Each timestep/Each person: most likely location
Quality: Realtime – Improve over MLE? Lahar -- SIGMOD Christopher Re 28 Declare an event “true”, if its Pr > threshold Vary threshold Precision Recall F1 10% improvement in F1
Performance: Is the cost too high? Lahar -- SIGMOD Christopher Re29 Synthetic Data – Same query
Related Work Lahar -- SIGMOD Christopher Re30 Event Queries – Deterministic Cayuga, SASE, SnoopIB Model-Based Views BBQ, recently, Kanagal et al ICDE 08 Probabilistic Databases Mystiq, Trio, MayBMS, Maryland, Purdue,MCDB Particle Filters on HMMs Doucet, Godsill
Conclusion Lahar -- SIGMOD Christopher Re31 Showed Lahar Processed output of several inference tasks (HMMs) Applies more generally than just RFID Quality (F1) gains by keeping probability Performance usable in real-time Lots of concurrent tags No indexing!
Lahar -- SIGMOD Christopher Re32
Overview of Regular Query Algorithm Lahar -- SIGMOD Christopher Re33 1. Compile an event query q 1. Automaton (A) over a language L 2. Mapping (M) events to subsets of L 2. Runtime – Input is set of events E 1. Map E into subsets of L via M 2. Maintain set of possible states of A Deterministic Probabilistic stays same distribution Size of distribution depends only on the query, q. NB: example to follow For details, see paper
Why are ER queries hard? Lahar -- SIGMOD Christopher Re34 Regular Queries ~ Regular Expressions Mapping is non-trivial Inspired by Cayuga [Demers et al. 06] Queries have #P-combined complexity Encode mDNF as regular expression Intuition: n-sized automaton leads to Extended regular ~ 1 NFA per/person k persons implies O(k)-size automaton Exponential cost When ER, can avoid blowup
Regular and Extended Regular Lahar -- SIGMOD Christopher Re35 Query is regular if no variable is shared between subgoals Query is extended regular if any variable shared by two subgoals, is shared by all subgoals p is shared between subgoals
Correlations Lahar -- SIGMOD Christopher Re36
Sequencing by example Lahar -- SIGMOD Christopher Re37 Sequencing is parameterized [Cayuga] Time Semicolon means “the next event among those that match next goal” Semicolon is not “after”
Compilation by example Lahar -- SIGMOD Christopher Re38 Each goal “corresponds” to two letters: move (m) – the query should advance accept (a) – the next subgoal accepts Any other maps to empty set Final Does not contain Does contain
Subtle example.. Lahar -- SIGMOD Christopher Re39 What about: Any other maps to empty set Final Does not contain Does contain
CUT II Lahar -- SIGMOD Christopher Re40
Motivating Apps Lahar -- SIGMOD Christopher Re41 RFID apps Diary and Active Calendar Application. Alert if I go to a database meeting. Supply chain Alert if Mach 3 razors are being stolen Many independent HMMs Elder care [Intel/UW] Alert if elder takes their medicine with water Activity Recognition Financial applications on predictive HMM Alert if head-and-shoulders market
Compile Select and Filter Lahar -- SIGMOD Christopher Re42 Intuition: goal maps to two letters: match (m) : matches filter accept (a) : accepted by select Final Does not contain Does contain language and automaton are the same for both queries
Wrinkle in the language: Filter v. Selection Lahar -- SIGMOD Christopher Re43 “Alert next time Joe is in 502 after he is in 501” Time Yes No “Alert if the next place Joe is in after 501 is 502” At
Recap of Algorithms Lahar -- SIGMOD Christopher Re44 Regular Queries Compiled them to an NFA, then used image Data complexity O(1) Extended regular Several regulars multiplied together Depends on number of distinct people in the data, not number of time steps.
Lahar -- SIGMOD Christopher Re45 Text1 Eculid uclid
Lahar Queries by Example Lahar -- SIGMOD Christopher Re46 Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4Joe in 422 Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4Joe in 422 Inspired by Cayuga [Demers et al 2006, White et al 2007] Challenge with probabilities: Naïve approach is exponential; unavoidable (#P)
Quality: Archived – Improve over Viterbi? Lahar -- SIGMOD Christopher Re 47 Smoothing v. Viterbi (MAP) Lahar tracks of Markovian Correlations Viterbi leverages correlations for MAP estimate PrecisionRecallF1 Approx ~30% gain in F1