Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering.

Slides:



Advertisements
Similar presentations
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Advertisements

Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
“Mortgages, Privacy, and Deidentified Data” Professor Peter Swire Ohio State University Center for American Progress Consumer Financial Protection Bureau.
Privacy Enhancing Technologies
The State of the Art Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AA A AAA.
Social Genome: Putting Big Data to Work to Advance Society Hye-Chung Kum Texas A&M Health Science Center, Dept. of Health Policy & Management University.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
 Guarantee that EK is safe  Yes because it is stored in and used by hw only  No because it can be obtained if someone has physical access but this can.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
1 Trust and Privacy in Authorization Bharat Bhargava Yuhui Zhong Leszek Lilien CERIAS Security Center CWSA Wireless Center Department of CS and ECE Purdue.
CUMC IRB Investigator Meeting November 9, 2004 Research Use of Stored Data and Tissues.
Privacy-Aware Computing Introduction. Outline  Brief introduction Motivating applications Major research issues  Tentative schedule  Reading assignments.
The Role of IRBs in Ensuring Ethical Conduct of QI Activities Mary Ann Baily, PhD Columbia IRB Conference April 1, 2011.
Current Developments in Differential Privacy Salil Vadhan Center for Research on Computation & Society School of Engineering & Applied Sciences Harvard.
Database Access Control & Privacy: Is There A Common Ground? Surajit Chaudhuri, Raghav Kaushik and Ravi Ramamurthy Microsoft Research.
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Private Analysis of Graphs
The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
26th WATCH: Differential Privacy: Theoretical and Practical Challenges Salil Vadhan Harvard University THURSDAY Jan. 15, Noon, Room 110 W ashington A rea.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
Design of Health Technologies lecture 20 John Canny 11/21/05.
The analyses upon which this publication is based were performed under Contract Number HHSM C sponsored by the Center for Medicare and Medicaid.
JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Where are the Academic Jobs ? Interactive Exploration of Job Advertisements in Geospatial and Topical Space Angela M. Zoss 1, Michael Conover 2 and Katy.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
KNOWLEDGE BASED TECHNIQUES INTRODUCTION many geographical problems are ill-structured an ill-structured problem "lacks a solution algorithm.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
The 1 st Competition on Critical Assessment of Data Privacy and Protection The privacy workshop is jointly sponsored by iDASH (U54HL108460) and the collaborating.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
A Whirlwind Tour of Differential Privacy
PRIVACY TOOLS FOR SHARING RESEARCH DATA NSF site visit October 19, 2015 Salil Vadhan Supported by the NSF Secure & Trustworthy Cyberspace (SaTC) program,
Transition to Practice. We define “Transition to Practice” as making privacy tools and systems operational.
Anonymity and Privacy Issues --- re-identification
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Mining of Massive Datasets Edited based on Leskovec’s from
Big Data Analytics Are we at risk? Dr. Csilla Farkas Director Center for Information Assurance Engineering (CIAE) Department of Computer Science and Engineering.
A Policy Based Infrastructure for Social Data Access with Privacy Guarantees Tim Finin (UMBC) for: Palanivel Kodeswaran (UMBC) Evelyne Viegas (Microsoft.
Gisella Stalloch Cliff Snellgrove. Understand how scientists manage their data Determine data management issues in specific scientific fields Determine.
“Translational research includes two areas of translation. One (T1) is the process of applying discoveries generated during research in the laboratory,
Sergey Yekhanin Institute for Advanced Study Lower Bounds on Noise.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
CMSC 818J: Privacy enhancing technologies Lecture 2.
Information Security, Theory and Practice.
Differentially Private Verification of Regression Model Results
Privacy TOOLS FOR SHARING RESEARCH DATA
TRUST Area 3 Overview: Privacy, Usability, & Social Impact
Introduction C.Eng 714 Spring 2010.
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Current Developments in Differential Privacy
Differential Privacy and Statistical Inference: A TCS Perspective
Objective of This Course
Data Warehousing and Data Mining
Autonomous Aggregate Data Analytics in Untrusted Cloud
Data Warehousing Data Mining Privacy
Published in: IEEE Transactions on Industrial Informatics
CS639: Data Management for Data Science
Some contents are borrowed from Adam Smith’s slides
OpenDP: A Pitch for a Community Effort
Differential Privacy (1)
Presentation transcript:

Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering & Applied Sciences Harvard University on sabbatical at Shing-Tung Yau Center Department of Applied Mathematics National Chiao-Tung University Lecture at Institute of Information Science, Academia Sinica November 9, 2015

Data Privacy: The Problem Given a dataset with sensitive information, such as: Census data Health records Social network activity Telecommunications data How can we: enable “desirable uses” of the data while protecting the “privacy” of the data subjects? Academic research Informing policy Identifying subjects for drug trial Searching for terrorists Market analysis … Academic research Informing policy Identifying subjects for drug trial Searching for terrorists Market analysis … ????

NameSexBloodHIV? ChenFBY JonesMAN SmithMON RossMOY LuFAN ShahMBY Approach 1: Encrypt the Data Problems? NameSexBloodHIV?

NameSexBloodHIV? ChenFBY JonesMAN SmithMON RossMOY LuFAN ShahMBY Approach 2: Anonymize the Data “re-identification” often easy [Sweeney `97] Problems?

Approach 3: Mediate Access C C trusted “curator” q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 data analysts Problems? NameSexBloodHIV? ChenFBY JonesMAN SmithMON RossMOY LuFAN ShahMBY Even simple “aggregate” statistics can reveal individual info. [Dinur-Nissim `03, Homer et al. `08, Mukatran et al. `11, Dwork et al. `15]

Privacy Models from Theoretical CS ModelUtilityPrivacyWho Holds Data? Differential Privacystatistical analysis of dataset individual-specific info trusted curator Secure Function Evaluation any query desiredeverything other than result of query original users (or semi-trusted delegates) Fully Homomorphic (or Functional) Encryption any query desiredeverything (except possibly result of query) untrusted server

Differential privacy C C curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 data analysts Requirement: effect of each individual should be “hidden” [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] SexBloodHIV? FBY MAN MON MOY FAN MBY

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY MAN MON MOY FAN MBY adversary

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY MAN MON MOY FAN MBY Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily adversary

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY MON MOY FAN MBY Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily adversary

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY FAY MON MOY FAN MBY Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily adversary

Simple approach: random noise C “What fraction of people are type B and HIV positive?” SexBloodHIV? FBY MAN MON MOY FAN MBY M

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C randomized curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY FAY MON MOY FAN MBY adversary

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C randomized curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY FAY MON MOY FAN MBY adversary

Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherry- Nissim ’05, Dwork-McSherry-Nissim-Smith ’06] C C randomized curator q1q1 a1a1 q2q2 a2a2 q3q3 a3a3 SexBloodHIV? FBY FAY MON MOY FAN MBY adversary

Simple approach: random noise C “What fraction of people are type B and HIV positive?” SexBloodHIV? FBY MAN MON MOY FAN MBY C

Answering multiple queries C “What fraction of people are type B and HIV positive?” SexBloodHIV? FBY MAN MON MOY FAN MBY C

Answering multiple queries C “What fraction of people are type B and HIV positive?” SexBloodHIV? FBY MAN MON MOY FAN MBY C

Some Differentially Private Algorithms histograms [DMNS06] contingency tables [BCDKMT07, GHRU11, TUV12, DNT14], machine learning [BDMN05,KLNRS08], regression & statistical estimation [CMS11,S11,KST11,ST12,JT13 ] clustering [BDMN05,NRS07] social network analysis [HLMJ09,GRU11,KRSY11,KNRS13,BBDS13] approximation algorithms [GLMRT10] singular value decomposition [HR12, HR13, KT13, DTTZ14] streaming algorithms [DNRY10,DNPR10,MMNW11] mechanism design [MT07,NST10,X11,NOS12,CCKMV12,HK12,KPRU12] … See Simons Institute Workshop on Big Data & Differential Privacy 12/13Simons Institute Workshop on Big Data & Differential Privacy 12/13

Differential Privacy: Interpretations Whatever an adversary learns about me, it could have learned from everyone else’s data. Mechanism cannot leak “individual-specific” information. Above interpretations hold regardless of adversary’s auxiliary information. Composes gracefully (k repetitions ) k  differentially private) But No protection for information that is not localized to a few rows. No guarantee that subjects won’t be “harmed” by results of analysis.

Answering multiple queries C “What fraction of people are type B and HIV positive?” SexBloodHIV? FBY MAN MON MOY FAN MBY C

Amazing possibility: synthetic data Utility: preserves fraction of people with every set of attributes! “fake” people [Blum-Ligett-Roth ’08, Hardt-Rothblum `10] C C SexBloodHIV? FBY MAN MON MOY FAN MBY SexBloodHIV? MBN FBY MOY FAN FON

Our result: synthetic data is hard “fake” people [Dwork-Naor-Reingold-Rothblum-V. `09, Ullman-V. `11] Theorem: Any such C requires exponential computing time (else we could break all of cryptography). C C SexBloodHIV? FBY MAN MON MOY FAN MBY SexBloodHIV? MBN FBY MOY FAN FON

Contains the same statistical information as synthetic data But can be computed in sub-exponential time! rich summary C C SexBloodHIV? FBY MAN MON MOY FAN MBY Our result: alternative summaries [Thaler-Ullman-V. `12, Chandrasekaran-Thaler-Ullman-Wan `13] Open: is there a polynomial-time algorithm?

Amazing Possibility II: Statistical Inference & Machine Learning C SexBloodHIV? FBY MAN MON MOY FAN MBY

DP Theory & Practice Theory: differential privacy research has many intriguing theoretical challenges rich connections w/other parts of CS theory & mathematics e.g. cryptography, learning theory, game theory & mechanism design, convex geometry, pseudorandomness, optimization, approximability, communication complexity, statistics, … Practice: interest from many communities in seeing whether DP can be brought to practice e.g. statistics, databases, medical informatics, privacy law, social science, computer security, programming languages, …

Challenges for DP in Practice

Some Efforts to Bring DP to Practice CMU-Cornell-PennState “Integrating Statistical and Computational Approaches to Privacy” (See Google “RAPPOR" UCSD “Integrating Data for Analysis, Anonymization, and Sharing” (iDash) UT Austin “Airavat: Security & Privacy for MapReduce” UPenn “Putting Differential Privacy to Work” Stanford-Berkeley-Microsoft “Towards Practicing Privacy” Duke-NISSS “Triangle Census Research Network” Harvard “Privacy Tools for Sharing Research Data” MIT/CSAIL/ALFA "MoocDB Privacy tools for Sharing MOOC data" …

Computer Science, Law, Social Science, Statistics Privacy tools for sharing research data Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the funders of the work. A SaTC Frontier project

30 Target: Data Repositories

Datasets are restricted due to privacy concerns Goal: enable wider sharing while protecting privacy

Challenges for Sharing Sensitive Data Complexity of Law Thousands of privacy laws in the US alone, at federal, state and local level, usually context-specific: HIPAA, FERPA, CIPSEA, Privacy Act, PPRA, ESRA, …. Difficulty of Deidentification Stripping “PII” usually provides weak protections and/or poor utility Inefficient Process for Obtaining Restricted Data Can involve months of negotiation between institutions, original researchers Goal: make sharing easier for researcher without expertise in privacy law/cs/stats Sweeney `97

Vision: Integrated Privacy Tools Risk Assessment and De-Identification Risk Assessment and De-Identification Differential Privacy Customized & Machine- Actionable Terms of Use Customized & Machine- Actionable Terms of Use Data Tag Generator Data Set Query Access Restricted Access Tools to be developed during project Consent from subjects Open Access to Sanitized Data Set IRB proposal & review Policy Proposals and Best Practices Database of Privacy Laws & Regulations Deposit in repository

Already can run many statistical analyses (“Zelig methods”) through the Dataverse interface, without downloading data.

A new interactive data exploration & analysis tool to in Dataverse 4.0. Plan: use differential privacy to enable access to currently restricted datasets

Goals for our DP Tools General-purpose: applicable to most datasets uploaded to Dataverse. Automated: no differential privacy expert optimizing algorithms for a particular dataset or application Tiered access: DP interface for wide access to rough statistical information, helping users decide whether to apply for access to raw data (cf. Census PUMS vs RDCs) (Limited) prototype on project website:

Differential Privacy: Summary

Privacy Models from Theoretical CS ModelUtilityPrivacyWho Holds Data? Differential Privacystatistical analysis of dataset individual-specific info trusted curator Secure Function Evaluation any query desiredeverything other than result of query original users (or semi-trusted delegates) Fully Homomorphic (or Functional) Encryption any query desiredeverything (except possibly result of query) untrusted server See Shafi Goldwasser’s talk at White House-MIT Big Data Privacy Workshop 3/3/14