Private Data Management with Verification

Slides:



Advertisements
Similar presentations
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Advertisements

Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Evaluating Classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CS573 Data Privacy and Security Statistical Databases
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
CpSc 881: Machine Learning Evaluating Hypotheses.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Generalized Linear Models (GLMs) and Their Applications.
Classification Ensemble Methods 1
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Lecturer: Ing. Martina Hanová, PhD..  How do we evaluate a model?  How do we know if the model we are using is good?  assumptions relate to the (population)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
SIMILARITY SEARCH The Metric Space Approach
Chapter 7. Classification and Prediction
Evaluating Classifiers
Data Driven Resource Allocation for Distributed Learning
Stochastic Streams: Sample Complexity vs. Space Complexity
Differentially Private Verification of Regression Model Results
Line Fitting James Hayes.
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Sampling Distributions & Point Estimation
Roberto Battiti, Mauro Brunato
Background: Lattices and the Learning-with-Errors problem
Differential Privacy in Practice
Vitaly (the West Coast) Feldman
Ying shen Sse, tongji university Sep. 2016
Model Evaluation and Selection
Parametric Methods Berlin Chen, 2005 References:
Introduction to Sensor Interpretation
Published in: IEEE Transactions on Industrial Informatics
Topological Signatures For Fast Mobility Analysis
Introduction to Sensor Interpretation
More on Maxent Env. Variable importance:
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Private Data Management with Verification Yan Chen Duke University Advisor: Ashwin Machanavajjhala

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Data Privacy

Differential Privacy Definition 1 : ε-Differential Privacy A randomized algorithm M satisfies ε-Differential Privacy if for any two neighboring datasets D1 and D2, any output S, [C.Dwork etc. ICALP 2006]

Differential Privacy Property 1 (Sequential Composition) M1 and M2 satisfy ε1 and ε2-differential privacy. Releasing the results of both M1(D) and M2(D) will satisfy (ε1+ε2)-differential privacy. Property 2 (Parallel Composition) If D1, D2 are subsets of D and D1∩D2 = Φ. Then releasing M1(D1) and M2(D2) will satisfy max(ε1,ε2)-differential privacy. Property 3 (Post-processing) If M3 is any algorithm, releasing M3(M1(D)) will still ε1-differential privacy.

Laplace Mechanism Definition 2 : Laplace Mechanism For any function f: D -> R^n, the Laplace Mechanism M: M(D) = f(D) + η. η is a vector of independent random variables drawn from a Laplace distribution with parameter = Δ(f) / ε. Δ(f): global sensitivity of f [C.Dwork etc. ICALP 2006]

Private Data Management Framework Data Curator Data Synthesizer Querier Verifier

Framework - Open Questions Differentially Private Algorithms for private verification on different tasks Protection for Data Synthesis

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Differentially Private Regression Diagnostics Generate Model Evaluate Model (Regression Diagnostics) Algorithms for linear/logistic regression while ensuring privacy No privacy-preserving techniques for regression diagnostics

Differentially Private Regression Diagnostics PriRP – Residual Plot (an error measure for linear regression) PriROC – ROC curve (an error measure for logistic regression)

Residual Plot Linear Regression models the outcome: Suppose b is the estimate model, the residual of each point: Residual Plot: residuals v.s. predicted values

Residual Plot

Private Residual Plot - PriRP Private Bounds Computation Residual Plots Perturbation

Private Residual Plot - PriRP Private Bounds Computation Real bounds contain sensitive info of data The sensitivity of the bound is infinity. Q: Identify the bounds (-b,b) such that at least θ fraction of the points are contained in (-b,b) with high probability? SVT based algorithm [C. Dwork 14] qi : how many points within the bound (-u*2^i, u*2^i) ?

Private Residual Plot - PriRP Residual Plots Perturbation Q: Estimate 2D probability density inside a bounded region? 1. Discretization 2. Perturbation 3. Sampling

Private Residual Plot - PriRP Empirical Evaluation (data scale = 5000)

Private Residual Plot - PriRP Empirical Evaluation Define similarity between real RP and perturbed RP: Discretize the bound of real RP into 10*10 equal-width grid cells Compute the distribution of residuals among all grids cells c in real RP and perturbed RP, denoted as P(c) and P’(c)

Private Residual Plot - PriRP Empirical Evaluation

ROC curve

ROC curve ROC curve: TPR v.s. FPR in terms of all possible θ AUC: area under the curve

Private ROC Curve - PriROC Choosing Thresholds Computing TPRs and FPRs Ensuring Monotonicity

Private ROC Curve - PriROC Choosing Thresholds 1. data independent strategy: fix |Θ| = N+1, Θ = {0,1/N,…,N-1/N,1} Problem: Bad for the skewed predictions 2. data dependent strategy: Ideas: iteratively choose thresholds evenly dividing the data => iteratively finding medians (as thresholds) (smooth sensitivity & deal with invalid thresholds)

Private ROC Curve - PriROC Computing TPRs and FPRs Compute TPRs from computing prefix range queries on Similarly for computing FPRs

Private ROC Curve - PriROC Ensuring Monotonicity To ensure monotonicity, applying method from [Hay. VLDB 10]

Private ROC Curve - PriROC Empirical Evaluation

Private ROC Curve - PriROC Empirical Evaluation AUC Symmetric Difference

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Future Work - Verification Counting queries 1. Data Independent Algorithms (easy) e.g. Laplace Mechanism 2. Data Dependent Algorithms (hard) err is data dependent

Future Work - Verification Definition: Sensitivity of Randomized Algorithm For any randomized algorithm A: D -> R with random variable stream N, we say the randomized algorithm A has sensitivity Δ, if for any two neighboring datasets D and D’, any fixed values of N, Theorem: If randomized algorithm A has sensitivity Δ, then satisfies ε-differential privacy and

Future Work - Verification Another interesting problem: Given an error bound, offer the output only when its error is bounded by the error bound w.h.p.

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Future Work - Data Synthesis Queries on the synthetic data release the information of the synthetic data. Differentially Private Data Synthesis good in terms of the privacy for the whole system, but too much noise Weaker privacy definition? Data synthesis process should be protected

Future Work - Data Synthesis What kind of weaker privacy definition we can use for generating synthetic data? Can the chosen weaker privacy definition composed with differential privacy? How the whole system is protected? Even if the weaker privacy definition is composed with differential privacy, what is the tightest composition result? More complex data synthesis algorithms: Can we empirically evaluate what they protect?

Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary

Summary We present the framework for private data management with verification and propose some open questions We start with query verification on differentially private regression diagnostics. We propose the first differentially private algorithms PriRP (for linear regression) and PriROC (for logistic regression) We present our initial works on verification of data dependent algorithms for counting queries. We briefly show the idea of private data synthesis as another future direction.