Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 

Slides:



Advertisements
Similar presentations
Sublinear Algorithms … Lecture 23: April 20.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
CSC 774 Advanced Network Security
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Mining Data Streams.
Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.
What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Dictionaries and Hash Tables1  
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Program Performance & Asymptotic Notations CSE, POSTECH.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Chapter 14 Randomized algorithms Introduction Las Vegas and Monte Carlo algorithms Randomized Quicksort Randomized selection Testing String Equality Pattern.
Analysis of Algorithms
Algorithm Evaluation. What’s an algorithm? a clearly specified set of simple instructions to be followed to solve a problem a way of doing something What.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
PIRS: Query Verification on Data Streams  Ke Yi, Hong Kong University of Science and Technology  Feifei Li, Florida State University  Marios Hadjieleftheriou,
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Secure Data Outsourcing
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Complexity Analysis (Part I)
Mining Data Streams (Part 1)
New Characterizations in Turnstile Streams with Applications
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Chapter 12: Query Processing
External Memory Hashing
Spatial Online Sampling and Aggregation
Pyramid Sketch: a Sketch Framework
Advanced Topics in Data Management
Qun Huang, Patrick P. C. Lee, Yungang Bao
External Memory Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Lu Tang , Qun Huang, Patrick P. C. Lee
Complexity Analysis (Part I)
Maintaining Stream Statistics over Sliding Windows
Complexity Analysis (Part I)
Presentation transcript:

Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs)  Divesh Srivastava (AT&T Labs)  George Kollios (Boston U)

Outsourcing  Manufacturing  Software development  Service  Data TRUST?

Data Outsourcing Model 3 Owner: owns data Servers: host (or process) the data and provide query services Clients: query the owner’s data through servers ownerserversclients/ (possibly = owner) the unified client model

Outsourced Database for Better Query Services 4 Servers that are close to local clients and maintained by local business partners Company with headquarters in US

Data Outsourcing Model 5 Owner/client: owns data and issue queries Servers: host (or process) the data and provide query services serversOwner/client the unified client model

Model Comparison 3-party model2-party model Model One data owner, a few servers, many clients One data owner/client, one server Motivation Better serve clients in different locations Owner does not have enough resources Client Client does not have access to data Client has access to data Techniques Digital signatures, one- way hash functions, Merkle hash trees, etc. ? Previous work LotFew

Data Stream Outsourcing 7 Network Gigascope: analysis tool by IP Traffic Stream coming from small business … … statistics Results

Concrete Example SELECT COUNT(*) FROM IP_trace GROUP BY srcIP, destIP Answer: 8 pmpm p3p3 p2p2 p1p1... IP Stream: : srcIP, destIP n 1,5405, ,794 Groups

The Model for the Stream 9 1i S 1 … 0 V 000 … V1V1 V2V2 V3V3 VnVn 10 ViVi 12 T=1T=2 T=3 group_id Major issue: space

Information Security Issues 10  The third-party (server) cannot be trusted  Lazy service provider  Malicious intent  Compromised equipment  Unintentional errors (e.g. bugs)

A Simple Solution [Sion, VLDB 05]  Accumulate b queries  The owner computes r of them itself  Compute the hashes of these results, with some fake ones  Ask the server to identify these r queries  Problems:  Can only prevent (very) lazy service provider  How about malicious attacks?  Need to accumulate enough queries  What if there is only one query?  High cost: r queries need to processed locally  High failure probability: 10%-30% (typically)

Continuous Query Verification: CQV 12 0 V 000 … V1V1 V2V2 V3V3 VnVn 90 ViVi S 1 … T=1T=2T=3 Update V XTXT Synopsis Update X 0020 … V1V1 V2V2 V3V3 VnVn 90 ViVi 52 Alarm 000 … V1V1 V2V2 V3V3 VnVn ViVi 12 no alarm

PIRS: Polynomial Identity Random Synopsis 13 choose prime p : chose a random number : raise alarm if not equal o/w no alarm

Incremental Update to PIRS 14 1i S … T=1T=2 update to v 1 update to v i

It Solves CQV problem! 15 Theorem: Given anyPIRS raises an alarm with probability at least 1-δ, otherwise no alarm. a polynomial with 1 as the leading coefficient is completely determined by its zeroes (and the corresponding multiplicity) due to the fundamental theorem of algebra. happens at no more than m values of x Since we have p>m/ δ choices for a : the probability that X(V)=X(W) is at most δ

Optimality of PIRS 16 Theorem: PIRS occupies O(log(m/δ) + log n) bits of space (3 words only at most, i.e., p, a, X(V) ), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log(min{n,m}/δ)) bits.

In Practice  Failure probability  Choose largest p that fits in a word  E.g, if we use 64-bit words, then failure probability is δ = m / p < (assuming m<2 32 )  Space requirement  p, a, X(V) : 3 words!  Time requirement  For count queries / selection queries  One subtraction, one multiplication, one mod  For sum queries:  log(u) multiplications: exponentiation by squaring

Multiple Queries 18 Q1Q1 Q2Q2 X1X1 X2X2 Q1Q1 Q2Q2 X 1,8 S … update to v 1 update to v 8 Theorem: our synopses use constant space for multiple queries. V 1..n1 V 1..n2 V 1..(n1+n2)

Some Experiments 19  We use real streams:  World Cup Data (WC)  IP traces from the AT&T network (IP)  We perform the following query:  WC: Aggregate on response size and group by client id/object id (50M groups)  IP: Aggregate on packet size and group by source IP/destination IP (7M groups)  Hardware for the client:  2.8GHz Intel Pentium 4 CPU  512 MB memory  Linux Machine

Memory Usage of Exact 20 PIRS using only constant 3 words (27 bytes) at all time. Exact’s memory usage is linear and expensive.

Update Time (per tuple) of Exact 21 1.Exact is fast when memory usage is small. 2.It becomes extremely slow due to cache misses. Cache misses

Running Time Analysis 22 WCIPs Count0.98 μs Sum8.01 μs6.69 μs Average Update Time IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC

Multiple Queries: Exact Memory Usage 23 PIRS always uses only 3 words. Exact’s memory usage is linear w.r.t number of queries and increasing over time.

CQV with Load Shedding 24

PIRS γ : An Exact Solution 25 PIRS … k buckets Alarm vivi b i =2 If at least γ buckets raise alarms PIRS … … log 1/δ Alarm If at least one layer raises alarms

PIRS γ : An Exact Solution 26 Theorem: PIRS γ requires O(γ 2 log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple and solves CQV with semantic load shedding.

Intuition on Approximation 27 number of errors probability to raise alarm γ the ideal synopsis γ-γ-γ+γ+ the approximation

PIRS ± γ : An Approximate Solution 28 Theorem: PIRS ±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple.

PIRS ± γ : An Approximate Solution 29 Theorem: PIRS ±γ : 1.raises no alarm with probability at least 1- δ on any 2.raises an alarm with probability at least 1- δ on any For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound.

PIRS ± γ : An Approximate Solution 30 PIRS … k buckets Alarm vivi b i =2 If all k buckets raise alarms PIRS … … log 1/δ Alarm If majority layers raise alarms

PIRS ± γ : Experiments

Related Techniques to PIRS 32  Incremental Cryptography  Block operation (insert, delete), cannot support arithmetic operation  Sketches  Provide approximate estimates  We want absolute accuracy  Often much more costly  Space O(1/) or O(1/ 2 )  Fingerprinting Technique  PIRS is a fingerprinting technique  Polynomial identity verification

Thanks! 33  Questions