+ Differential Privacy in the Streaming World Aleksandar (Sasho) Nikolov Rutgers University.

Slides:



Advertisements
Similar presentations
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Data Stream Algorithms Frequency Moments
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Raef Bassily Penn State Local, Private, Efficient Protocols for Succinct Histograms Based on joint work with Adam Smith (Penn State) (To appear in STOC.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Sketching in Adversarial Environments Or Sublinearity and Cryptography 1 Moni Naor Joint work with: Ilya Mironov and Gil Segev.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Private Analysis of Graphs
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Calculating frequency moments of Data Stream
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Understanding Generalization in Adaptive Data Analysis
Streaming & sampling.
Privacy-preserving Release of Statistics: Differential Privacy
Graph Analysis with Node Differential Privacy
Sublinear Algorithmic Tools 2
Spatial Online Sampling and Aggregation
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Vitaly (the West Coast) Feldman
Lecture 7: Dynamic sampling Dimension Reduction
Range-Efficient Counting of Distinct Elements
Y. Kotidis, S. Muthukrishnan,
Range-Efficient Computation of F0 over Massive Data Streams
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
Presentation transcript:

+ Differential Privacy in the Streaming World Aleksandar (Sasho) Nikolov Rutgers University

+ The Streaming Model Underlying frequency vector A = A [1], …, A[n] start with A[i] = 0 for all i. We observe an online sequence of updates: Increments only (cash register): Update is i t  A[i t ] := A[i t ] + 1 Fully dynamic (turnstile): Update is (i t, ±1)  A[i t ] := A[i t ] ± 1 Requirements: compute statistics on A Online, O(1) passes over the updates Sublinear space, polylog(n,m) 1, 4, 5, 19, 145, 14, 5, 5, 16, 4 +, -, +, -, +, +, -, +, -, +

+ Typical Problems Frequency moments: F k = |A[1]| k + … + |A[n]| k related: L p norms Distinct elements: F 0 = #{i: A[i] ≠ 0} k-Heavy Hitters: output all i such that A[i] ≥ F 1 /k Median: smallest i such that A[1] + … + A[i] ≥ F 1 /2 Generalize to Quantiles Different models: Graph problems: a stream of edges, increments or dynamic matchings, connectivity, triangle count Geometric problems: a stream of points various clustering problems

+ When do we need this? The universe size n is huge. Fast arriving stream of updates: IP traffic monitoring Web searches, tweets Large unstructured data, external storage: multiple passes make sense Streaming algorithms can provide a first rough approximation decide whether and when to analyze more fine tune a more expensive solution Or they can be the only feasible solution

+ Outline Introduction to small space streaming Small space & differential privacy Privacy under continual observation Pan-privacy

+ A taste: the AMS sketch for F 2 [Alon Matias Szegedy 96] h:[n]  {± 1} is 4-wise independent + h(i 1 ) = ± 1 h(i4)h(i4) h(i3)h(i3) h(i2)h(i2) X E[X 2 ] = F 2 E[X 4 ] 1/2 ≤ O(F 2 )

+ The Median of Averages Trick X 11 X 12 X 13 X 14 X 21 X 22 X 23 X 24 X 31 X 32 X 33 X 34 X 41 X 42 X 43 X 44 X 51 X 52 X 53 X 54 Average X1X1 X2X2 X3X3 X4X4 X5X5 Median X 1/ α 2 ln 1/ δ Average: reduces variance by α 2. Median: reduces probability of large error to δ.

+ Outline Introduction to small space streaming Small space & differential privacy Privacy under continual observation Pan-privacy

+ Defining Privacy for Streams We will use differential privacy. The database is represented by a stream online stream of transactions offline large unstructured database Need to define neighboring inputs: Event level privacy: differ in a single update 1, 4, 5, 19, 145, 14, 5, 5, 16, 4 1, 1, 5, 19, 145, 14, 5, 5, 16, 4 User level privacy: replace some updates to i with updates to j 1, 4, 5, 19, 145, 14, 5, 5, 16, 4 1, 4, 3, 19, 145, 14, 3, 5, 16, 4 We also allow the changed updates to be placed somewhere else

+ Streaming & DP? Large unstructured database of transactions Estimate how many distinct users initiated transactions? i.e. F 0 estimation Can we satisfy both the streaming and privacy constraints? F 0 has sensitivity 1 (under user privacy) Computing F 0 exactly takes Ω (n) space Classic sketches from streaming may have large sensitivity

+ Oblivious Sketch Flajolet and Martin [FM 85] show a sketch f(S) O(log n) bits of storage F 0 /2 ≤ f(S) ≤ 2F 0 with constant probability Obliviousness: distribution of f(S) is entirely determined by F 0 similar to functional privacy [Feigenbaum Ishai Malkin Nissim Strauss Wright 01] Why it helps: Pick noise η from discretized Lap(1/ ε ) Create new stream S’ to feed to f: If η < 0, ignore first η distinct elements If η > 0, insert elements n+1, …, n+ η Distribution of f(S’) is a function of max{F 0 + η, 0 }: ε -DP (user) Error: F 0 /2 – O(1/ ε )≤ f(S) ≤ 2F 0 + O(1/ ε ) Space: O(1/ ε + log n) can make log n w.h.p. by first inserting O(1/ ε ) elements

+ Open Problems When can a streaming estimate of a low-sensitivity function be computed privately, in small space? does privacy & small space ever require more error than either? Can we go beyond low-sensitivity, and local sensitivity? F 2 has high sensitivity and high local sensitivity Lipschitz extensions [Kasiviswanathan Nissim Raskhodnikova Smith 13] relevant? What can we say about graph problems, clustering problems? Private coresets [Feldman Fiat Kaplan Nissim 09]

+ Outline Introduction to small space streaming Small space & differential privacy Privacy under continual observation Pan-privacy

+ Continual Observation In an online stream, often need to track the value of a statistic. number of reported instances of a viral infection sales over time number of likes on Facebook Privacy under continual observation [Dwork Naor Pitassi Rothblum 10]: At each time step the algorithm outputs the value of the statistic The entire sequence of outputs is ε -DP (usually event level) Results: A single counter (number of 1’s in a bit stream) [DNPR10] Time-decayed counters [Bolot Fawaz Muthukrishnan Nikolov Taft 13] Online learning [DNPR10] [Jain Kothari Thakurta 12] [Smith Thakurka 13] Generic transformation for monotone algorithms [DNPR10]

+ Binary Tree Technique [DPNR10], [Chan Shi Song 10] Sensitivity of tree: log m Add Lap(log m/ ε ) to each node

+ Binary Tree Technique Each prefix: sum of log m nodes  polylog error per query

+ Open Problems What is the optimal error possible for the counter problem? Privacy under continual observation for statistics that are not easily decomposable? User level? Expect privacy under continual observation to be ever more relevant We usually want to track our statistics over time Work on it!

+ Outline Introduction to small space streaming Small space & differential privacy Privacy under continual observation Pan-privacy

+ Pan Privacy Differential privacy guarantees that the results of our computation are private What if data is requests by subpoena, leaked after a security breach, an unauthorized employee looks at it? Can we guarantee that intermediate states are also private? Makes sense for online data: not stored Pan-privacy [Dwork Naor Pitassi Rothblum Yekhanin 10]: For each t: the state of the algorithm after processing the t-th update and the final output are jointly ε -DP Can be event level or user level Strategy: keep private statistics on top of sketches

+ Warm-up: F 0 [DNPRY10] Solution: randomized response Two distributions: D 0 and D 1 on {-1,1} D 0 is 1 w.p. 1/2; D 1 is 1 w.p. (1 + ε )/2 Store a big table X[1], …, X[n] Initialize all X[i] from D 0 When update i t arrives, pick X[i t ] from D 1 Can compute O(n 1/2 / ε ) additive approximation X = (X[1] + … + X[n])/ ε E[X] = F 0 and E[X 2 ] = n/ ε 2

+ Cropped F 1 [Mir Muthukrishnan Nikolov Wright 11] Cropped moments: F k ( τ ) = |min{A[1], τ }| k + |min{A[2], τ }| k + … + |min{A[n], τ }| k We’ll be interested in F 1 ( τ ) Can pan-privately compute X s.t. F 1 ( τ )/2 – O( τ n 1/2 / ε ) ≤ X ≤ F 1 ( τ ) + O( τ n 1/2 / ε ) Idea: keep each A[i] mod τ, with initial noise What if A[i] = τ + 1? Multiply each A[i] by a random c i uniform in [1, 2] Small A[i] (≤ τ /2) get distorted by at most factor 2 For large A[i], c i A[i] mod τ is large on average Range is τ, so noise O( τ / ε ) per modular counter suffices A[i] c i A[i] 0 2τ2τ τ

+ Heavy Hitters [DNPRY10][MMNW11] Recall, the k-Heavy Hitters (k-HH) are i s.t. A[i] ≥ F 1 /k at most k of them Approximate the number of k-HH notation: H k a measure of how skewed the data is Will get pan-private estimator X s.t.: H k /2 – O(k 1/2 ) ≤ X ≤ H k log k + O(k 1/2 )

+ k-HH and Cropped F 1 Say we want to compute an estimate X in [H k, H ck ] Consider: (F 1 (F 1 /k) - F 1 (F 1 /ck))/(F 1 /k – F 1 /ck) k-Heavy Hitters contribute 1 ck-Heavy Hitters contribute between 0 and 1 Anything else contributes 0 Error of O(F 1 n 1/2 /k ε ) for F 1 (F 1 /k) is too much! Sketch to reduce the universe size n

+ Idea: Use a (CM-type) Sketch Hash [n] into [O(k)] (with a pairwise-independent hash) Compute the number of heavy buckets (weight ≥ F 1 /k) at least H k /2 (balls and bins) no bucket containing items of weight ≤ F 1 /(k * log k) is heavy Essentially keeping private statistics on a CM sketch A[1]A[2]A[3]A[4]A[5]A[6]A[7]A[8]A[9] A[10 ] B[1]B[2]B[3]B[4]

+ Lower bounds and Open Problems The O(n 1/2 ) additive error for F 0 is optimal also O(k 1/2 ) for H k, by reduction Idea: combine streaming-style LBs with reconstruction attacks [MMNW11] stop the algorithm at some time step and grab the private state different continuations of the stream: answer many counting queries from the same state invoke [Dinur Nissim 03] type attacks Lower bounds against many passes via connections to randomness extraction [McGregor Mironov Pitassi Reingold Talwar Vadhan 10] Do all problems of low streaming complexity admit accurate pan-private algorithm intuitively: less state  easier to make private

+ Summary Private analysis of massive online data presents new challenges small space continuous monitoring Data is not stored: can ask for algorithms private inside and out Tools from small-space streaming algorithms can be useful but we need to view them from a new angle