Download presentation
Presentation is loading. Please wait.
1
Guilt-Free Data Dredging via Privacy
Omer Reingold SRA Joint with: Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, and Aaron Roth
3
False discovery — Just Getting Worse
“Trouble at the Lab” – The Economist
4
From: Trustworthy-broker@trustme.com
To: Date: 2/27/15 Subject: Gr8 investment tip!!! Hi! You don’t know me, but here is a tip! Go long on SHAK – it will go up today.
5
From: Trustworthy-broker@trustme.com
To: Date: 2/28/15 Subject: Gr8 investment tip!!! Hi again! Go short today. SHAK is going down.
6
From: Trustworthy-broker@trustme.com
To: Date: 3/1/15 Subject: Gr8 investment tip!!! Down again.
7
From: Trustworthy-broker@trustme.com
To: Date: 3/2/15 Subject: Gr8 investment tip!!! Up!
8
From: Trustworthy-broker@trustme.com
To: Date: 3/3/15 Subject: Gr8 investment tip!!! Down.
9
From: Trustworthy-broker@trustme.com
To: Date: 3/4/15 Subject: Gr8 investment tip!!! Up!
10
From: Trustworthy-broker@trustme.com
To: Date: 3/5/15 Subject: Gr8 investment tip!!! Up!
11
From: Trustworthy-broker@trustme.com
To: Date: 3/6/15 Subject: Gr8 investment tip!!! Up!
12
From: Trustworthy-broker@trustme.com
To: Date: 3/7/15 Subject: Gr8 investment tip!!! Down.
13
From: Trustworthy-broker@trustme.com
To: Date: 3/8/15 Subject: Gr8 investment tip!!! Down.
14
From: Trustworthy-broker@trustme.com
To: Date: 3/8/15 Subject: Gr8 investment opportunity!!! Hi there. I’m tired of giving out this great advice for free. Let me manage your money, and I’ll continue giving you my stock prediction tips in exchange for a small cut! BANK ACCOUNT NUMBER PLS!!
15
Hmm… The chance he was right 10 times in a row if he was just randomly guessing is only ≈ 𝑝<0.05. I can reject the null hypothesis that these predictions were luck. HERE IS MY MONEY!
16
What happened 100,000
17
What happened 50,000
18
What happened 25,000
19
After 10 days… There remain ≈100 people who have received perfect predictions. The individual recipient’s error was his failure to take into account the size of the pool.
20
“The Multiple Hypothesis Testing Problem”
A finding with p-value ≤0.05 has only a 5% probability of being realized “by chance” But we expect 5 such findings if we test 100 hypotheses, even if they are all wrong… Even worse: Publication Bias
21
In the real world: does ketchup prevents cancer?
22
Hack Your Way To Scientific Glory
Try to prove a hunch: The U.S. economy is affected by whether Republicans or Democrats are in office. Using real data going back to 1948. 1,800 possible combinations Many possible settings of the learning algorithm “P hacking”, “data dredging”, very tempting
23
Preventing false discovery
Decade old subject in Statistics Powerful results such as Benjamini-Hochberg work on controlling False Discovery Rate Lots of tools: Cross-validation, bootstrapping, holdout sets Theory focuses on non-adaptive data analysis
24
Non-adaptive data analysis
Specify exact experimental setup hypotheses to test, … Collect data Run experiment Observe outcome Data analyst Can’t reuse data after observing outcome.
25
Adaptive data analysis
Specify exact experimental setup hypotheses to test, … Collect data Run experiment Observe outcome Revise experiment Data analyst
26
Why adaptivity is troublesome
Exacerbates the multiple hypothesis testing problem exponentially – must account for all hypotheses that might have been tested.
27
An adaptive data analyst is a decision tree…
𝑞 1 … 𝑞 4 𝑞 2 𝑞 10 … … … 𝑞 5 … 𝑘 … … … … 𝑞 16 𝑞 14 𝑞 13 𝑞 11 … … … 𝑞 37 𝑞 24
28
An adaptive data analyst is a decision tree…
Must account not just for the 𝑘 queries actually made, but the Ω( 2 𝑘 ) queries that could have been made given all possible outcomes. An adaptive data analyst is a decision tree… 𝑞 1 … 𝑞 4 𝑞 2 𝑞 10 … … … 𝑞 5 … 𝑘 … … … … 𝑞 16 𝑞 14 𝑞 13 𝑞 11 … … … 𝑞 37 𝑞 24
29
Adaptivity arises naturally.
Freedman’s Paradox: an equation 𝑌=𝑋𝛽+𝑧 is fitted erroneously with a little bit of adaptivity. Vast literature on addressing this and other special cases. Natural learning procedures (like gradient descent) adaptively query the data. Common practice to first “feel/plot/see the data” before coming up with hypotheses and setting the “right” parameters More insidious: studies conducted by researchers who have read papers that used the same data set must be considered adaptive. Will be the norm as large data sets are shared and re- used!
30
Better Habits? For every hypothesis to be tested gather fresh data?
Bad habits are sometimes prevailing for a reason: Easy, Cheap, Addictive Better if we can provide automatic correctness protection
31
Standard holdout method
training data Data Data analyst unrestricted access good for one validation holdout Reusing can cause overfitting: Kaggle’s data analysis competition Variant of Freedman’s Paradox Non-reusable: Can’t use information from holdout in training stage adaptively
32
essentially as good as using fresh data each time!
Our Reusable holdout training data Data analyst unrestricted access Data can be used many times adaptively reusable holdout essentially as good as using fresh data each time!
33
essentially as good as using fresh data each time!
Our Reusable holdout Check validity against holdout via a mechanism (differentially private) Goals: Provides useful information about the distribution, Control information leakage sufficiently well to prevent over-fitting Data analyst can be used many times adaptively reusable holdout essentially as good as using fresh data each time!
34
Differential Privacy [Dwork-McSherry-Nissim-Smith 06]
Alice Bob Xavier Chris Donna Ernie Algorithm ratio bounded Pr [r]
35
Intuition Differential privacy is a stability guarantee:
Changing one data point doesn’t affect the outcome much Stability implies generalization “Overfitting is not stable” Differential privacy composes well (helps deal with adaptivity)
36
Rich Algorithmic Literature
Counts, linear queries, histograms, contingency tables (marginals) Location and spread (eg, median, interquartile range) Dimension reduction (PCA, SVD) Support Vector Machines Sparse regression/LASSO, logistic and linear regression, gradient descent Clustering Boosting, Multiplicative Weights Combinatorial optimzation, mechanism design Privacy Under Continual Observation, Pan-Privacy Kalman filtering Statistical Queries learning model, PAC learning … The Algorithmic Foundations of Differential Privacy, Dwork and Roth, August 2014
37
Example: Thresholdout, an Instantiation of the Reusable Holdout [relies on HR10]
Input: Training set St, holdout set Sh, Budgets: m on total number of queries, and B on number of over fitting detected (Sh diff from St), Other Parameters: threshold T, tolerance τ Given a Statistical Query φ: X → [−1,1], do: If B < 1 or m < 1, output “⊥” Else sample η,ξ ∼ Lap(τ) If |E_Sh[φ]−E_St[φ]| > T + η, output E_Sh[φ] + ξ set B ← B−1. (b) Otherwise, output E_St[φ]. m ← m−1 Parameters: m can be exponential in |Sh|. B quadratic.
38
Conclusions and Further Research
Reusable holdout: provable correctness protection Utility and privacy can be in sync. Parameters: can we improve? What are the limits? Do we need Differential Privacy (max-information). Experiments: is it practical? Design practical methods for real data analysis problems Improved parameters and more general settings [Bassily, Nissim, Smith, Steinke, Stemmer, Ullman 15] Limitations for efficient mechanisms [Hardt,Ullman 14] and [Steinke,Ullman 15]
39
An illustrative experiment
Data set with 2n = 20,000 rows and d = 10,000 variables. Class labels in {-1,1} Analyst performs stepwise variable selection: Split data into training/holdout of size n Select “best” k variables on training data Only use variables also good on holdout Build linear predictor out of k variables Find best k = 10,20,30,…
40
No correlation between data and labels
data are random gaussians labels are drawn independently at random from {-1,1} Thresholdout correctly detects overfitting!
41
High correlation 20 attributes are highly correlated with target
remaining attributes are uncorrelated Thresholdout correctly detects right model size!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.