Presentation on theme: "GDPR, Data Privacy, Anonymization, Minimization. . .Oh My!"— Presentation transcript:
1 GDPR, Data Privacy, Anonymization, Minimization. . .Oh My! Steve Touw, Immuta
2 About Me/Immuta CTO of Immuta Immuta is a self-service platform where data owners, data scientist and compliance officers eliminate friction and accelerate innovation.Our software enables enterprises to unlock data, control risk, and innovate faster with confidence.
3 Agenda GDPR & data processing why do YOU care? Get out of GDPR jail free?The Anonymization zooThe “Data Control Plane”Conclusion
5 GDPR In A Nutshell“The General Data Protection Regulation is the EU’s primary data governance regulation and realistically applies to any business using data from EU data subjects. It is the most forward-leading privacy regime on the planet, with fines of up to four percent of global revenue. With such staggering fines, breaching the GDPR is a risk that many enterprises quite literally may not be able to afford.”Apple: 78.4 billion3 billion-Andrew Burt, Immuta
6 It’s All About Personal Data Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
13 Yes...This MeansThe New York Taxi Commission has personal data by GDPR definition (we identified individuals indirectly).GDPR would apply to the New York Taxi Commission (but probably only if the data was generated in an EU city)!Are you having an oh no moment?
14 GDPR Purpose Restrictions No room for interpretationConsent: personal data may be processed on the basis that the data subject has consented to such processingContractual necessity: processing is necessary in order to enter into or perform a contract with the data subjectCompliance with legal obligationsVital interests: this essentially applies in "life‑or-death" scenariosPublic interest: necessary for the performance of tasks carried out by a public authority or private organisation acting in the public interestLegitimate Interests: must be specified at time of collection and reasonable (accountability on the data controller)Room for interpretation by an auditor - riskier
15 Processing Principles Fair, lawful and transparent processing: ability to tell the data subject what their data is being used forThe purpose limitation principle: what we just discussedData minimisation: only process the personal data that it actually needs to process in order to achieve its goalsAccuracy: responsibility for taking all reasonable steps to ensure that personal data are accurateData retention periods: data should not be retained for longer than necessary in relation to the purposes for which they were collectedData security: data are kept secure, both against internal and external threatsAccountability: enforcement of the Data Protection Principles
16 Those Principles and Purposes are Scary...Maybe… “Once a dataset is truly anonymised and individuals are no longer identifiable, European data protection law no longer applies.”-Article 29 Working Party
18 Pseudonymization“the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.”-GDPR Article 4(5)
19 Pseudonymization In the Wild Back to our New York Taxi Data...They actually did go to the trouble of pseudonymizing the data by hashing the medallion id. But that didn’t matter...
20 More Link Attacks NY Taxi Data Medallion & Pickup Time Medallion & Photo TimePickup Time & Pickup LocPickup Time & Pickup LocPickup Time & Pickup LocPickup Loc & Dropoff LocPickup Loc & Dropoff LocPickup Loc & Dropoff LocDropoff Loc & Dropoff TimeDropoff Loc & Dropoff TimeDropoff Loc & Dropoff TimeDropoff Time & AmountDropoff Time & AmountDropoff Time & Receipt
21 Cardinality is the Achilles Heel of Anonymization What did all those columns we linked have in common? -- They have many unique values (high cardinality).The more unique values, the more opportunity to pinpoint and link an external source. These columns contain what is termed quasi- identifiersQuasi-identifiers aren’t personal data necessarily! You’re hashing for anonymity, not privacy - thus removing utility!(I always wear a helmet and nothing else)
22 The Privacy vs Utility Tradeoff This is what our data looks like now to prevent link attacks:Remove all quasi-identifiers, remove all utility!NOTNOT
23 Pseudonymization Good, But Not Party Time: “pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure.”-Article 29 Working PartyIn plain English: GDPR requires that you pseudonymize when you can because that minimizes risk; GDPR’s “privacy by design”So it does buy you something, but GDPR still applies.
24 The Anonymization ZooLet’s go through some other anonymization techniques. Will we get to party time?K-AnonymizationDifferential Privacy
25 K-AnonymizationThink of k-anonymization as a better way to hash like we did for the taxi data in the prior slides, yet provides more utility.This is done by generalizing quasi-identifiers by making them more “coarse”, becoming homogeneous with their neighborsEach record is then indistinguishable from at least k-1 other records, forming an equivalence class30.6208*26208522086824292087825CoordinatesZip CodeAge
26 Example: Generalizing By Zip Code Homogeneity AttackBlack Female born in 1965, do we know their problem? -- YESBlack Male born in 1965, do we know their problem? -- NoBlack Male, do we know their problem? -- No
27 K-Anonymized Taxi Data K-anonymized pickup & dropoff loc and timeCertainly more utilityBut same problems...Link attack on very unique pickup/dropoffHomogeneity attack: everyone tipped the sameL-Diversity, T-Closeness, has its own problems
28 K-Anonymization In the Wild I’m not the only one that gets the joke now!
29 K-Anonymization, Better Utility, No Party K-Anonymization provides no guarantees of privacyK-Anonymization is computationally intensive to build - searching for K-perfection, L- Diversity, T-Closeness may be a waste of timeThere’s still a privacy vs utility tradeoff to contend withOne should mask (pseudonymize) personal data and generalize quasi-identifiers to meet “privacy by design” principles whenever possibleNOTSLIGHTLY
30 The Privacy vs Utility Game Let’s have some fun...
31 The movie title is our “private” data We can generalizeWe can mask the rest….
35 The Anonymization ZooLet’s go through some other anonymization techniques. Will we get to party time?K-AnonymizationDifferential Privacy
36 Let’s Play Another a Game... Think of a number 1 - 6Now I’m going to ask you a private question you may not want to answer in publicDid you, or would you have, voted for Brexit?Now, if you thought of a “3” or answered “YES” to Brexit, then raise your hand when this counter gets to zero:123
37 Differential Privacy‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry]($320k$340k$330k$30MSensitivity of median = ~10kSensitivity of mean = ~30Mklucar[10:31 AM]no with S/A we would first have much more data, and we would split the data up, calculate separate means for each group, then use the median of the groups.[10:31]so basically the 30M would get thrown out and you'd get a mean closer to the mediansteve [10:32 AM]right, this example is stupid because there’s only 4 rows. The reason S and A works is because there’d be more data, not because it’s S and A?[10:32]well, I guess it’s kinda bothI gotcha, though[10:32 AM]both.steve [10:33 AM]we get a better feel for global sensitivity[10:33 AM]yeah if we did straight DP, we would have to look a what the most expensive house could ever be and add noise proportional to that house.
38 There’s a Catch! (Three of Them) 1. You can only ask “aggregate” questions of the data.For example, the count of hands raised, but not specifically who’s handSUM, COUNT, AVERAGE, MIN, MAX2. If you ask the same/similar question enough - you’ll find the right answer!!You know, statistics...if you flip a coin 100 times, you’re going to get really close to 50% each side. The “Privacy Budget”.3. “Epsilon” (amount of privacy) is not intuitive and hard to assign
39 So What Would Differential Privacy Look Like In Our Movie Game? Let’s pretend the rating was the sensitive piece of dataSelect AVG(rating) WHEN title = ‘The Terminator’126.96.36.199.6Average = 7.95The more we ask, the more we pound away the noise
40 Differential Privacy SLIGHTLY Differential Privacy does provide guarantees of privacy!But, there are still utility limitations:You need to understand you can only ask general / aggregate type questions. This should be intuitive: you shouldn’t ask specifics of anonymized dataVery hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front.Intuition about privacy settings (epsilon)There are tricks you can do hereSLIGHTLY
41 Don’t Rely On Anonymization Alone! Recital 26**, talking about anonymization: “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.” **Note that anonymization is only ever mentioned in recital 26. Recitals can be thought of as commentary and some would consider non-binding.Until there’s GDPR guidance about when data is “reasonably likely” to be re- identified, early adopters will face an uncertain regulatory environment RISK!!Even with the guarantees of Differential Privacy, one still needs to meet the principals and purpose requirements for original collection!!The effectiveness (and legality) of both anonymization and pseudonymization hinge on their abilities to protect data subjects from re-identification.
42 What I Recommend A rock solid governance solution in your organization Data OwnershipAccess PoliciesAppropriate UsageData LifecycleAlways some level of anonymization and/or pseudonimization to meet the privacy by design requirements
43 GovernanceData Ownership: Owns the data and makes decisions on how and if it can be accessed - and are held accountable for those decisionsAccess Policies: Who can access the data, what exactly can they see, and under what circumstances?Appropriate Usage: What constitutes appropriate and inappropriate use of data internally and externally, particularly for automated decisions?Data Lifecycle: How to manage acquiring, storing, selling, and purging your data?Governance is not memos and glorified wikis - it’s actual enforcement through software!
44 A Complex ProblemYou have data everywhere in many different storage technologies, and now complex data governance requirements to enforceDON’T IMPLEMENT UNIQUELY PER DATABASE!DON’T DATA LAKE FOR THE PURPOSE OF COMPLIANCE SIMPLIFICATION!ConsentTransparencyRetentionAnonymizationLegitimate interestsMinimizationAccountability
45 The Data Control Plane A Data Control Plane Consent Transparency RetentionAnonymizationLegitimate interestsMinimizationAccountabilityData OwnershipData LifecycleAccess PoliciesAppropriate UsageThe Data Control PlaneConsentTransparencyRetentionAnonymizationLegitimate interestsMinimizationAccountability
46 Tenants of a Data Control Plane Simplicity: Easy to create privacy rules and expose authoritative views of data from any storage technologyMutability: Ability to change rules and have that reflected in the data on the flyAccessibility: Plane cannot force users to an API to access the data → Needs to be accessible by any language or toolContext: State of access requests needs to be understood to enforce rules appropriately (link data to analytical context, e.g. purpose)Visibility: All actions in the plane are audited, all policies are understandable
47 A Critical Component: Purposes Purpose-based restrictions are the future of privacy controlsPurpose-based restrictions DO NOT fit in the identity management frameworks we’re used toIdentity: Roles, Groups, Authorizations - GRANTED TO MEPurpose: Context, Dynamic, Layered - REACT TO MY CONTEXT
48 ConclusionDon’t try to shortcut GDPR. Always pseudonimize/anonymize when possible, but don’t use it to escape GDPR, at least not yet.Necessity is the mother of invention: you’ll see your data science operations soar once governance is applied appropriately.Governance can be an enabler!