Presentation is loading. Please wait.

Presentation is loading. Please wait.

GDPR, Data Privacy, Anonymization, Minimization. . .Oh My!

Similar presentations

Presentation on theme: "GDPR, Data Privacy, Anonymization, Minimization. . .Oh My!"— Presentation transcript:

1 GDPR, Data Privacy, Anonymization, Minimization. . .Oh My!
Steve Touw, Immuta

2 About Me/Immuta CTO of Immuta
Immuta is a self-service platform where data owners, data scientist and compliance officers eliminate friction and accelerate innovation. Our software enables enterprises to unlock data, control risk, and innovate faster with confidence.

3 Agenda GDPR & data processing why do YOU care?
Get out of GDPR jail free? The Anonymization zoo The “Data Control Plane” Conclusion

4 General Data Protection Regulation (GDPR)

5 GDPR In A Nutshell “The General Data Protection Regulation is the EU’s primary data governance regulation and realistically applies to any business using data from EU data subjects. It is the most forward-leading privacy regime on the planet, with fines of up to four percent of global revenue. With such staggering fines, breaching the GDPR is a risk that many enterprises quite literally may not be able to afford.” Apple: 78.4 billion 3 billion -Andrew Burt, Immuta

6 It’s All About Personal Data
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.

7 Let’s Talk Privacy

8 I know stuff about Judd and Leslie!
Photo credit: Gawker

9 New York Taxi & Limousine Commission
Data was released containing taxi pickups, dropoffs, location, time, amount, and tip amount, among others. Seems pretty harmless?

10 Well, Judd and Leslie May Not Think It’s Harmless
This photos was geotagged (with time), so by simply querying by medallion and time, we know how much Judd and Leslie tip!

11 This Is An Example Of a Link Attack
NY Taxi Data Medallion & Pickup Time Medallion & Photo Time

12 I Swear This Is Relevant... Back To GDPR

13 Yes...This Means The New York Taxi Commission has personal data by GDPR definition (we identified individuals indirectly). GDPR would apply to the New York Taxi Commission (but probably only if the data was generated in an EU city)! Are you having an oh no moment?

14 GDPR Purpose Restrictions
No room for interpretation Consent: personal data may be processed on the basis that the data subject has consented to such processing Contractual necessity: processing is necessary in order to enter into or perform a contract with the data subject Compliance with legal obligations Vital interests: this essentially applies in "life‑or-death" scenarios Public interest: necessary for the performance of tasks carried out by a public authority or private organisation acting in the public interest Legitimate Interests: must be specified at time of collection and reasonable (accountability on the data controller) Room for interpretation by an auditor - riskier

15 Processing Principles
Fair, lawful and transparent processing: ability to tell the data subject what their data is being used for The purpose limitation principle: what we just discussed Data minimisation: only process the personal data that it actually needs to process in order to achieve its goals Accuracy: responsibility for taking all reasonable steps to ensure that personal data are accurate Data retention periods: data should not be retained for longer than necessary in relation to the purposes for which they were collected Data security: data are kept secure, both against internal and external threats Accountability: enforcement of the Data Protection Principles

16 Those Principles and Purposes are Scary...Maybe…
“Once a dataset is truly anonymised and individuals are no longer identifiable, European data protection law no longer applies.” -Article 29 Working Party

17 Let’s Talk Anonymization

18 Pseudonymization “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.” -GDPR Article 4(5)

19 Pseudonymization In the Wild
Back to our New York Taxi Data... They actually did go to the trouble of pseudonymizing the data by hashing the medallion id. But that didn’t matter...

20 More Link Attacks NY Taxi Data Medallion & Pickup Time
Medallion & Photo Time Pickup Time & Pickup Loc Pickup Time & Pickup Loc Pickup Time & Pickup Loc Pickup Loc & Dropoff Loc Pickup Loc & Dropoff Loc Pickup Loc & Dropoff Loc Dropoff Loc & Dropoff Time Dropoff Loc & Dropoff Time Dropoff Loc & Dropoff Time Dropoff Time & Amount Dropoff Time & Amount Dropoff Time & Receipt

21 Cardinality is the Achilles Heel of Anonymization
What did all those columns we linked have in common? -- They have many unique values (high cardinality). The more unique values, the more opportunity to pinpoint and link an external source. These columns contain what is termed quasi- identifiers Quasi-identifiers aren’t personal data necessarily! You’re hashing for anonymity, not privacy - thus removing utility! (I always wear a helmet and nothing else)

22 The Privacy vs Utility Tradeoff
This is what our data looks like now to prevent link attacks: Remove all quasi-identifiers, remove all utility! NOT NOT

23 Pseudonymization Good, But Not Party Time:
“pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure.” -Article 29 Working Party In plain English: GDPR requires that you pseudonymize when you can because that minimizes risk; GDPR’s “privacy by design” So it does buy you something, but GDPR still applies.

24 The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy

25 K-Anonymization Think of k-anonymization as a better way to hash like we did for the taxi data in the prior slides, yet provides more utility. This is done by generalizing quasi-identifiers by making them more “coarse”, becoming homogeneous with their neighbors Each record is then indistinguishable from at least k-1 other records, forming an equivalence class 30.6 208* 26 20852 20868 24 29 20878 25 Coordinates Zip Code Age

26 Example: Generalizing By Zip Code
Homogeneity Attack Black Female born in 1965, do we know their problem? -- YES Black Male born in 1965, do we know their problem? -- No Black Male, do we know their problem? -- No

27 K-Anonymized Taxi Data
K-anonymized pickup & dropoff loc and time Certainly more utility But same problems... Link attack on very unique pickup/dropoff Homogeneity attack: everyone tipped the same L-Diversity, T-Closeness, has its own problems

28 K-Anonymization In the Wild
I’m not the only one that gets the joke now!

29 K-Anonymization, Better Utility, No Party
K-Anonymization provides no guarantees of privacy K-Anonymization is computationally intensive to build - searching for K-perfection, L- Diversity, T-Closeness may be a waste of time There’s still a privacy vs utility tradeoff to contend with One should mask (pseudonymize) personal data and generalize quasi-identifiers to meet “privacy by design” principles whenever possible NOT SLIGHTLY

30 The Privacy vs Utility Game
Let’s have some fun...

31 The movie title is our “private” data
We can generalize We can mask the rest….

32 Challenge 1: Basically what NY Taxi Did

33 Challenge 2: More Anonymization Applied
19** 3 hours

34 Challenge 3: Perfectly Private, But Completely Useless
8.2 19** 1 hour 19** 86 723 user, 201 critic 438

35 The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy

36 Let’s Play Another a Game...
Think of a number 1 - 6 Now I’m going to ask you a private question you may not want to answer in public Did you, or would you have, voted for Brexit? Now, if you thought of a “3” or answered “YES” to Brexit, then raise your hand when this counter gets to zero: 1 2 3

37 Differential Privacy ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry] ( $320k $340k $330k $30M Sensitivity of median = ~10k Sensitivity of mean = ~30M klucar [10:31 AM] no with S/A we would first have much more data, and we would split the data up, calculate separate means for each group, then use the median of the groups. [10:31] so basically the 30M would get thrown out and you'd get a mean closer to the median steve [10:32 AM] right, this example is stupid because there’s only 4 rows. The reason S and A works is because there’d be more data, not because it’s S and A? [10:32] well, I guess it’s kinda both I gotcha, though [10:32 AM] both. steve [10:33 AM] we get a better feel for global sensitivity [10:33 AM] yeah if we did straight DP, we would have to look a what the most expensive house could ever be and add noise proportional to that house.

38 There’s a Catch! (Three of Them)
1. You can only ask “aggregate” questions of the data. For example, the count of hands raised, but not specifically who’s hand SUM, COUNT, AVERAGE, MIN, MAX 2. If you ask the same/similar question enough - you’ll find the right answer!! You know, statistics...if you flip a coin 100 times, you’re going to get really close to 50% each side. The “Privacy Budget”. 3. “Epsilon” (amount of privacy) is not intuitive and hard to assign

39 So What Would Differential Privacy Look Like In Our Movie Game?
Let’s pretend the rating was the sensitive piece of data Select AVG(rating) WHEN title = ‘The Terminator’ 7.3 8.6 8.3 7.6 Average = 7.95 The more we ask, the more we pound away the noise

40 Differential Privacy SLIGHTLY
Differential Privacy does provide guarantees of privacy! But, there are still utility limitations: You need to understand you can only ask general / aggregate type questions. This should be intuitive: you shouldn’t ask specifics of anonymized data Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front. Intuition about privacy settings (epsilon) There are tricks you can do here SLIGHTLY

41 Don’t Rely On Anonymization Alone!
Recital 26**, talking about anonymization: “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.” **Note that anonymization is only ever mentioned in recital 26. Recitals can be thought of as commentary and some would consider non-binding. Until there’s GDPR guidance about when data is “reasonably likely” to be re- identified, early adopters will face an uncertain regulatory environment  RISK!! Even with the guarantees of Differential Privacy, one still needs to meet the principals and purpose requirements for original collection!! The effectiveness (and legality) of both anonymization and pseudonymization hinge on their abilities to protect data subjects from re-identification.

42 What I Recommend A rock solid governance solution in your organization
Data Ownership Access Policies Appropriate Usage Data Lifecycle Always some level of anonymization and/or pseudonimization to meet the privacy by design requirements

43 Governance Data Ownership: Owns the data and makes decisions on how and if it can be accessed - and are held accountable for those decisions Access Policies: Who can access the data, what exactly can they see, and under what circumstances? Appropriate Usage: What constitutes appropriate and inappropriate use of data internally and externally, particularly for automated decisions? Data Lifecycle: How to manage acquiring, storing, selling, and purging your data? Governance is not memos and glorified wikis - it’s actual enforcement through software!

44 A Complex Problem You have data everywhere in many different storage technologies, and now complex data governance requirements to enforce DON’T IMPLEMENT UNIQUELY PER DATABASE! DON’T DATA LAKE FOR THE PURPOSE OF COMPLIANCE SIMPLIFICATION! Consent Transparency Retention Anonymization Legitimate interests Minimization Accountability

45 The Data Control Plane A Data Control Plane Consent Transparency
Retention Anonymization Legitimate interests Minimization Accountability Data Ownership Data Lifecycle Access Policies Appropriate Usage The Data Control Plane Consent Transparency Retention Anonymization Legitimate interests Minimization Accountability

46 Tenants of a Data Control Plane
Simplicity: Easy to create privacy rules and expose authoritative views of data from any storage technology Mutability: Ability to change rules and have that reflected in the data on the fly Accessibility: Plane cannot force users to an API to access the data → Needs to be accessible by any language or tool Context: State of access requests needs to be understood to enforce rules appropriately (link data to analytical context, e.g. purpose) Visibility: All actions in the plane are audited, all policies are understandable

47 A Critical Component: Purposes
Purpose-based restrictions are the future of privacy controls Purpose-based restrictions DO NOT fit in the identity management frameworks we’re used to Identity: Roles, Groups, Authorizations - GRANTED TO ME Purpose: Context, Dynamic, Layered - REACT TO MY CONTEXT

48 Conclusion Don’t try to shortcut GDPR. Always pseudonimize/anonymize when possible, but don’t use it to escape GDPR, at least not yet. Necessity is the mother of invention: you’ll see your data science operations soar once governance is applied appropriately. Governance can be an enabler!

49 @steve_touw

Download ppt "GDPR, Data Privacy, Anonymization, Minimization. . .Oh My!"

Similar presentations

Ads by Google