PIRS: Query Verification on Data Streams  Ke Yi, Hong Kong University of Science and Technology  Feifei Li, Florida State University  Marios Hadjieleftheriou,

Slides:



Advertisements
Similar presentations
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Fast Algorithms For Hierarchical Range Histogram Constructions
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Mining Data Streams.
Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Approximating Hit Rate Curves using Streaming Algorithms Nick Harvey Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield TexPoint.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Calculating frequency moments of Data Stream
Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Approximation Algorithms based on linear programming.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Complexity Analysis (Part I)
Mining Data Streams (Part 1)
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
On the Size of Pairing-based Non-interactive Arguments
A paper on Join Synopses for Approximate Query Answering
Streaming & sampling.
Outsourced Computation Verification
Query-Friendly Compression of Graph Streams
Spatial Online Sampling and Aggregation
Load Shedding Techniques for Data Stream Systems
Advanced Topics in Data Management
Objective of This Course
Qun Huang, Patrick P. C. Lee, Yungang Bao
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Lu Tang , Qun Huang, Patrick P. C. Lee
Complexity Analysis (Part I)
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

PIRS: Query Verification on Data Streams  Ke Yi, Hong Kong University of Science and Technology  Feifei Li, Florida State University  Marios Hadjieleftheriou, AT&T Labs  George Kollios, Boston University  Divesh Srivastava, AT&T Labs work done while the 1 st and 2 nd authors were working at AT&T labs.

Publishing Data and Outsourcing Query Service 2 Network Gigascope: analysis tool by IP Traffic Stream coming from … … statistics Results

Revisiting the CISCO – AT&T Example 3 Network Gigascope IP Traffic Stream … … statistics lawyers: sign the trust agreementCould we help? (computer scientists)

Concrete Example Continuous Query: SELECT SUM(packet_size) FROM IP_trace GROUP BY srcIP, destIP Answer: 4 pmpm p3p3 p2p2 p1p1... IP Stream: : srcIP, destIP, packet_size n 510KB2KB150KB...5KB 1011KB130KB1MB...20KB Time Groups

Continuous Query Verification (CQV) on Data Streams 5 1.Client register query 2.Server reports answer upon request Server maintains exact answer Client maintains synopsis X Both client and server monitor the same stream Source of streams Group 1 Group 2 Group 3 … … … SELECT SUM(packet_size) From IP_Trace GROUP BY src_ip, dest_ip

The Model for the Stream 6 9|17|i S 1|1 … 0 VTVT 000 … V1V1 V2V2 V3V3 VnVn 90 ViVi 710 T=1T=2 T=3 agg_attribute | group_id

Continuous Query Verification: CQV 7 0 VTVT 000 … V1V1 V2V2 V3V3 VnVn 90 ViVi 710 9|17|i S 1|1 … T=1T=2T=3 Update V XTXT Synopsis Update X 0020 … V1V1 V2V2 V3V3 VnVn 90 ViVi 510 Alarm 000 … V1V1 V2V2 V3V3 VnVn ViVi 710 no alarm

PIRS: Polynomial Identity Random Synopsis 8 choose prime p : chose a random number : raise alarm if not equal o/w no alarm

Incremental Update to PIRS 9 9|17|i S … T=1T=2 update to v 1 update to v i An update to group i with value u could be done in logu time (exponential by squaring): 1|1 update to v 1

It Solves CQV problem! 10 Theorem: Given anyPIRS raises an alarm with probability at least 1-δ a polynomial with 1 as the leading coefficient is completely determined by its zeroes Due to the fundamental theorem of algebra. happens at no more than m values of x Since we have p>m/ δ choices for a : the probability that X(V)=X(W) is at most δ

Optimality of PIRS 11 Theorem: PIRS occupies O(log m/δ + log n) bits of space (3 words only at most, i.e., p, a, X(V) ), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log min{n,m}/δ) bits.

Multiple Queries 12 Q1Q1 Q2Q2 X1X1 X2X2 Q1Q1 Q2Q2 X 9|1,8 S … update to v 1 update to v 8 Theorem: our synopses use constant space for multiple queries. V 1..n1 V 1..n2 V 1..(n1+n2)

Handle the Load Shedding 13  Semantic Load Shedding: drop tuples from certain groups  Small number of groups having errors  Random Load Shedding:  All groups have small amount of errors

CQV with Semantic Load Shedding 14 Randomly drop certain tuples according to groups 9|17|i2|j1|14|k … 5|1 Server claims at most γ number of groups have errors To detect if more than γ groups having errors! We have designed synopses using O( γ log 1/δ log n) bits of space and achieve the error probability at most δ

PIRS γ: An Exact Solution 15 PIRS … k buckets Alarm v8v8 b(8)=2 If at least buckets raise alarms PIRS … … log 1/δ Alarm If at least one layer raises alarms

PIRS γ: An Exact Solution 16 Theorem: PIRS γ requires O(γ 2 log1/δ logn) bits, spends O( log1/δ ) time to process a tuple and solves CQV with semantic load shedding.

Intuition on Approximation 17 number of errors probability to raise alarm γ the ideal synopsis γ-γ-γ+γ+ the approximation

PIRS ±γ: An Approximate Solution 18 Theorem: PIRS ±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple.

CQV with Random Load Shedding 19 Randomly drop tuples All groups have small errors To detect if any group has error greater than a claimed threshold Theorem: Any synopsis solves this problem with error probability at most δ requires at least Ω(n) bits (reducing to the problem of estimating infinite frequency moment: the number of occurrence of the most frequent item).

Sliding Window and Other Queries  It is easy to extend PIRS to work with sliding window model since it is decomposable, i.e., X(v1+v2)=X(v1)*X(v2).  Other queries that can be transformed into Group By aggregation queries.  Details in the paper. 20

Some Experiments 21  We use real streams:  World Cup Data (WC)  IP traces from the AT&T network (IP)  We perform the following query:  WC: Aggregate on response size and group by client id/object id (50M groups)  IP: Aggregate on packet size and group by source IP/destination IP (7M groups)  Hardware for the client:  2.8GHz Intel Pentium 4 CPU  512 MB memory  Linux Machine

Detection Accuracy 22 Over 100,000 random attacks, PIRS identifies all of them.

Memory Usage of Exact 23 PIRS using only constant 3 words (27 bytes) at all time. Exact’s memory usage is linear and expensive.

Update Time (per tuple) of Exact 24 1.Exact is fast when memory usage is small. 2.It becomes extremely slow due to cache misses and memory swap operations. Cache misses and memory swap

Running Time Analysis 25 WCIPs Count0.98 μs Sum8.01 μs6.69 μs Average Update Time IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC

Multiple Queries: Exact Memory Usage 26 PIRS always using only constant 3 words (27 bytes). Exact’s memory usage is linear w.r.t number of queries and increasing over time.

Multiple Queries: Exact Update Time Per Tuple 27

Multiple Queries: PIRS Update Time Per Tuple 28

The Library 29 Download PIRS and other synopses at:

Conclusion  Space and Update efficient synopsis for verifying continuous group-by aggregation queries on streaming data;  Could be generalized to handle selection query, and sliding-window semantics;  How about more complicated queries? 30

Thanks! 31  Questions

Problem and Goals 32  Assumption:  Client and DSMS observe the same stream  Problem:  Client needs to verify the results  Goals:  Be memory, update efficient  Tolerance for a limited number of errors  Tolerance for small errors  Support multiple queries

Related Techniques to PIRS 33  Incremental Cryptography  Block operation (insert, delete), cannot support arithmetic operation  Program Verification  Server may pass the program execution but simply return random outputs  Fingerprinting Technique  PIRS is a fingerprinting technique

CQV with Semantic Load Shedding 34

PIRS ±γ: An Approximate Solution 35 Theorem: PIRS ±γ : 1.raises no alarm with probability at least 1- δ on any 2.raises an alarm with probability at least 1- δ on any For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound.

PIRS ±γ: An Approximate Solution 36 PIRS … k buckets Alarm vivi b i =2 If all k buckets raise alarms PIRS … … log 1/δ Alarm If majority layers raise alarms

Information Disclosure on Multiple Attacks 37 R PIRS: X(V) on r Learns nothing about r Insight: server could potentially gets rid of δ portion of seeds from each notified failed attack!

Information Disclosure on Multiple Attacks 38 Bob Theorem: For the total of k attacks made by Bob to PIRS, the probability that none of them succeeds is at least 1-kδ.

Proof of the Optimality 39

Proof of the Optimality 40