Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009.

Similar presentations


Presentation on theme: "The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009."— Presentation transcript:

1 The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009

2 2 Data Privacy Personal information collected every day Healthcare, insurance information Supermarket transaction data RFID, GPS Data E-mail Employment history Web search / clickstream

3 3 Data Privacy Legal, ethical, technical issues surrounding –Data ownership –Data collection –Data dissemination and use Considerable recent interest from technical community –High-profile mishaps and lawsuits –Compliance with data-sharing mandates

4 4 Privacy Protection Technologies for Public Datasets Goal: Protect sensitive personal information while preserving data utility Privacy Policies and Mechanisms Example Policies: –Protect individual identities –Protect the values of sensitive attributes –Differential privacy [Dwork 06] Example Mechanisms: –Generalize (“coarsen”) the data –Aggregate the data –Add random noise to the data –Add random noise to query results

5 5 Observations Much work has focused on static data –One-time snapshot publishing –Disclosure by composing multiple different snapshots of a static database [Xiao 07, Ganta 08] –Auditing queries on a static database [Chin 81, Kenthapadi 06, …] What are the unique challenges when the data evolves over time?

6 6 Outline Sample Problem: Continuously publishing privacy-sensitive GPS traces –Motivation & problem setup –Framework for reasoning about privacy –Algorithms for continuous publishing –Experimental results Applications to other dynamic data speculation

7 7 GPS Traces (ongoing work w/ Wen Jin, Jignesh Patel) GPS devices attached to phones, cars Interest in collecting and distributing location traces in real time –Real-time traffic reporting –Adaptive pricing / placement of outdoor ads Simultaneous concern for personal privacy Challenge: Can we continuously collect and publish location traces without compromising individual privacy?

8 8 Data Recipient Problem Setting Central Trace Repository GPS Users (7 AM) Privacy Policy “Sanitized” Location Snapshot “Sanitized” Location Snapshot GPS Users (7:05 AM) “Sanitized” Location Snapshot “Sanitized” Location Snapshot

9 9 Problem Setting Finite population of n users with unique identifiers {u 1,…,u n } Assume users’ locations are reported and published in discrete epochs t 1,t 2,… Location snapshot D(t j ) –Associates each user with a location during epoch t j Publish sanitized version D*(t j )

10 10 Threat Model Attacker wants to determine the location of a target user u i during epoch t j Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)

11 11 Some Naïve Solutions Strawman 1: Replace users’ identifiers ({u 1,…,u n }) with pseudonyms ({p 1,…,p n }) –Problem: Once attacker “unmasks” user p i, he can track her location forever Strawman 2: New pseudonyms ({p 1 j,…,p n j }) at each epoch t j –Problem: Users can still be tracked using multi- target tracking tools [Gruteser 05, Krumm 07]

12 12 Key Problem: Motion Prediction 1 2 3 {Alice, Bob, Charlie} 4 5 6 What if the speed limit is 60 mph? Alice

13 13 Threat Model Attacker wants to determine the location of a target user u i during epoch t j Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages) Motion prediction: Given one or more locations for u i, attacker can predict (probabilistically) u i ’s location during following and preceding epochs

14 14 Privacy Principle: Temporal Unlinkability Consider an attacker who is able to identify (locate) target user u j during m sequential epochs Under reasonable assumptions, he should not be able to locate u j with high confidence during any other epochs * *Similar in spirit to “mix zones” [Beresford 03], which addressed a related problem in a less-formal way.

15 15 Sanitization Mechanism Needed to select a sanitization mechanism; chose one for maximum flexibility Assign each user u i consistent pseudonym p i Divide users into clusters –Within each cluster, break association between pseudonym, location Release candidate for D(t j ) D*(t j ) = {(C 1 (t j ), L 1 (t j )),…, (C B (t j ), L B (t j ))} –  i=1..B C i (t j ) = {p 1,…,p n } –C i (t j )  C h (t j ) =  (i  h) –Each L i (t j ) contains the locations of users in C i (t j )

16 16 Sanitization Mechanism: Example Pseudonyms {p 1, p 2, p 3, p 4 } {p1,p2} {p3,p4} t0 1 2 3 4 {p1,p2} {p3,p4} t1 5 6 7 8 {p1,p3} {p2,p4} t2 9 10 11 12

17 17 Reasoning about Privacy How can we guarantee temporal unlinkability under the threats of auxiliary information and motion prediction? –(Using the cluster-based sanitization mechanism) Novel framework with two key components –Motion model describes location correlations between epochs –Breach probability function describes an attacker’s ability to compromise temporal unlinkability

18 18 Motion Models Model motion using an h-step Markov chain –Conditional probability for user’s location, given his location during h prior (future) epochs –Same motion model used by attacker and publisher Forward motion model template –Pr[Loc(P,T j ) = L j | Loc(P,T j-1 ) = L j-1, …, Loc(P,T j-h ) = L j-h ] Backward motion model template –Pr[Loc(P,T j ) = L j | Loc(P,T j+1 ) = L j+1, …, Loc(P,T j+h ) = L j+h ] Independent and replaceable component –For this work, used 1-step motion model based on velocity distribution (speed and direction)

19 19 Motion Models: Example {p1,p2} {p3,p4} t0t1 Pseudonyms {p 1, p 2, p 3, p 4 } Epochs t 0, t 1, t 2 p1p2p3p4abcd t2 p3 p1 p2 p4 Pr[loc(p 1,t 1 ) = a|Loc(p 1,t 0 )=x] Pr[loc(p 1,t 1 ) = b|Loc(p 1,t 0 )=x] Pr[loc(p 1,t 1 ) = a|Loc(p 1,t 2 )=y]

20 20 Privacy Breaches Forward breach probability –Pr[Loc(P,T j ) = L j | D(T j-1 ), …, D(T j-h ), D*(T j )] Backward breach probability –Pr[Loc(P,T j ) = L j | D(T j+1 ), …, D(T j+h ), D*(T j )] Privacy Breach: Release candidate D*(T j ) causes a breach iff either of the following is true for threshold C max P, Lj Pr[Loc(P,T j ) = L j | D(T j-1 ), …, D(T j-h ), D*(T j )] > C max P, Lj Pr[Loc(P,T j ) = L j | D(T j+1 ), …, D(T j-h ), D*(T j )] > C

21 21 Privacy Breaches: Example {p1,p2} {p3,p4} t0t1 p1 p2 p3 p4 a b c d e1 = Pr[loc(p 1,t 1 ) = a|Loc(p 1,t 0 )=x] e2 = Pr[loc(p 1,t 1 ) = b|Loc(p 1,t 0 )=x] e3 = Pr[loc(p 2,t 1 ) = a|Loc(p 2,t 0 )=y] e4 = Pr[loc(p 2,t 1 ) = b|Loc(p 2,t 0 )=y] Pr[loc(p 1,t 1 ) = a|D(T0), D*(T1)] = e1 * e4 e1 * e4 + e2 * e3 … Goal: Verify that all (forward and backward) breach probabilities < threshold C x y

22 22 Checking for Breaches Does release candidate D*(T j ) cause a breach? Brute force algorithm –Exponential in release candidate cluster size Heuristic pruning tools –Reduce the search space considerably in practice

23 23 Publishing Algorithms How to publish useful data, without causing a privacy breach? Cluster-based sanitization mechanism offers two main options –Increase cluster size (or change composition) –Reduce publication frequency

24 24 Publishing Algorithms General Case –At each epoch T j, publish the most compact release candidate D*(T j ) that does not cause a breach –Need to delay publishing until epoch T j+h to check for backward breaches –NP-hard optimization problem; proposed alternative heuristics Special Case –Durable clusters (same individuals at each epoch) –Motion model satisfies symmetry property –No need to delay publishing

25 25 Experimental Study Used real highway traffic data from UM Transportation Research Institute –GPS data sampled from cars of 72 volunteers –Sampling rate (epoch) = 0.01 seconds –Speed range 0-170 km/hour Also synthetic data –Able to control the generative motion distribution

26 26 Experimental Study All static “snapshot” anonymization mechanisms vulnerable to motion prediction attacks –Applied two representative algorithms (r-Gather [Aggarwal 06] and k-Condense [Aggarwal 04]) –Each produces a set of clusters with  k users each r-Gather k-Condense

27 27 Speculation / Future Work GPS example illustrates importance of reasoning about data dynamics and history, and predictable patterns of change in privacy Dynamic private data in other apps. –E.g., longitudinal social science data Study subjects age predictably Most people don’t move very far Income changes predictably Hypothesis: History and prediction are important in these settings, too!


Download ppt "The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009."

Similar presentations


Ads by Google