Towards Robustness in Query Auditing Shubha U. Nabar Stanford University VLDB 2006 Joint Work With B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Online Auditing Kobbi Nissim Microsoft Based on a position paper with Nina Mishra.
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Correlation and regression
Online Auditing - How may Auditors Inadvertently Compromise Your Privacy Kobbi Nissim Microsoft With Nina Mishra HP/Stanford Work in progress.
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Hippocratic Databases Paper by Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, Yirong Xu CS 681 Presented by Xi Hua March 1st,Spring05.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Auditing Batches of SQL Queries Rajeev Motwani Shubha Nabar Dilys Thomas Stanford University.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
Database Security DBMS Features Statistical Database Security.
CS573 Data Privacy and Security Statistical Databases
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Page 1March 3, th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel.
Refined privacy models
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Additive Data Perturbation: the Basic Problem and Techniques.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Secure Data Outsourcing
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Private Data Management with Verification
Privacy-preserving Release of Statistics: Differential Privacy
Presented by : SaiVenkatanikhil Nimmagadda
Approximation and Load Shedding Sampling Methods
CS639: Data Management for Data Science
Differential Privacy (1)
Presentation transcript:

Towards Robustness in Query Auditing Shubha U. Nabar Stanford University VLDB 2006 Joint Work With B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani

Data Mining vs Privacy Large amount of data available in digital form Statisticians query data to mine useful trends Potential for privacy breaches

Online Query Auditing Given a stream of queries over a DB containing private information, when should queries be denied to protect privacy? Our focus:  Statistical DBs: census, hospital, employee  Only one private attribute, e.g., salary, disease  Statistical queries over private attribute: sum, max, mean  Stream of queries of single type from single user

Online Query Auditing Company Database Name Age Sex Salary Alice 23 F 42K Bob 25 M 50K Carl 30 M 80K Dave 21 M 35K Sum of salaries of female employees 42,000 Adversary Alice’s salary = $42,000!

Online Query Auditing In general, more complex queries can be posed and answers put together to deduce information Task of auditor: deny query when answer to current and past queries can be “stitched together” to leak information.

Our Contributions Auditor for max queries Auditor for combinations of max and min queries A first analysis of the utility of an auditing scheme

Related Work Perturbing data itself [W ‘65, AS ‘00, EGS ’03, CDMSW ‘05] Perturbing results supplied to user [DN ‘03, DMNS ‘06] Statisticians unhappy with addition of noise Auditors provide exact answers if at all

Previous Work Restricting Size and Overlap of Queries [Dobkins, Jones, Lipton ‘79] Offline Auditing [Chin ‘86] Auditing for Boolean Attributes [Kleinberg, Papadimitriou, Raghavan ‘03] Auditing Compliance with a Hippocratic Database [Agrawal, Bayardo, Faloutsos, Kiernan, Rantzau, Srikant ’04] Simulatable Auditing [Kenthapadi, Mishra, Nissim ‘05]

Naïve Auditor If answer to current query causes an element to be determined, deny Adversary Company Database Alice 23 F 42K Bob 25 M 50K Carl 30 M 80K Dave 21 M 35K max salary{Alice,Bob,Carl} 80,000 max salary{Alice,Bob} denied Carl’s salary = $80,000! Name Age Sex Salary

Simulatability Denials based on answer to current query may cause privacy breach Solution: If attacker can simulate and predict decision to deny ) denials do not leak information Auditor: If there is any dataset consistent with past answers in which current query causes breach, deny  Attacker can check condition himself  Denials do not leak information

Goal Find online, efficient, simulatable, high-utility auditors for various classes of queries

Definition of Privacy Breach Full Disclosure: some private data point can be uniquely determined  e.g. max{x a, x b, x c } = 10 max{x a, x b } = 8 ) x c = 10 Partial Disclosure (probabilistic compromise): significant change in attacker’s confidence about some private data value

Probabilistic Compromise Private data known to be drawn according to D Range of each data point divided in to intervals SDB qtqt atat 0101 query PriorPosterior

Outline Problem Statement Previous Work Auditing Max Queries Auditing Max and Min Queries Utility Future Work See paper for auditing against full disclosure

Skeleton of Probabilistic Auditor 1.Attacker poses query q t 2.Attacker has posterior distribution over answer to q t, given previous answers 3.Auditor repeatedly: a.Samples possible answer from this distribution b.Checks if sampled answer will change attacker’s belief about some data point 4.If q t “unsafe” in significant fraction of samples, deny Need to estimate posterior distributions in 2. and 3b.

Probabilistic Max Auditor Assumption: dataset drawn uniformly at random from set of duplicate-free points in [ ,  ] n  For each x i and any interval in [α,  ] prior prob uniform Given answers to set of queries, what are posterior probabilities?

Probabilistic Max Auditor Given queries q 1 …q t and answers a 1 …a t create synopsis B max B max contains predicates [max(S 1 ) = a 1 ], [max(S 2 ) < a 2 ]… S i s are disjoint B max enables succinct representation of audit trail B max enables computation of posterior probabilities

Determining Posterior Probabilities max{x a, x b, x c } = 0.75 xaxa xbxb xcxc (0.75, 0, 0) (0, 0.75, 0) (0, 0, 0.75) Pr{x a 2 [0,0.25]} Pr{x a 2 [0.25,0.5]} Pr{x a 2 [0.5,0.75)} Pr{x a = 0.75} Pr{x a = 0.75} = 1/3, since any one of x a, x b or x c is equally likely to be max With remaining 2/3 probability, x a is uniformly distributed in [0,0.75)

Probabilistic Max Auditor 1.Attacker poses query q t 2.Attacker has posterior distribution over answer to q t, given previous answers 3.Auditor repeatedly: a.Samples possible answer from this distribution b.Checks if sampled answer will change attacker’s belief about some data point 4.If q t “unsafe” in significant fraction of samples, deny Can give guarantees on probability that adversary learns new information

Outline Problem Statement Previous Work Auditing Max Queries Auditing Max and Min Queries Utility Future Work

Probabilistic Max-and-Min Auditor Computing posterior probabilities becomes harder Given queries, create synopsis so that a data point occurs in at most one max and one min predicate

Equivalent Graph Coloring Problem max{x a, x b, x c } = 1 min{x a, x b } = 0.2 max{x d, x e } = 2min{x c, x d, x e } = 0.5 a, b, c a, b d, e c, d, e Every valid coloring corresponds to a set of consistent datasets

Probabilistic Max-and-Min Auditor We show  Can sample consistent dataset according to posterior distribution by sampling valid coloring according to distribution P  Can sample valid coloring according to P using markov chain over colorings  Can use sampled colorings to answer questions about posterior distribution of data points up to arbitrary precision See paper for details

Outline Problem Statement Previous Work Auditing Max Queries Auditing Max and Min Queries Utility Future Work

Utility Several dimensions of utility:  How many queries are answered?  What kinds of queries are answered?  What can be computed?  “Price of simulatability” Expected time to first denial

Utility of Sum Auditor Consider full disclosure No prior knowledge – data points come from unbounded range Queries chosen uniformly at random

Sum Auditor xaxa xbxb xcxc xdxd xexe = a1a1 a2a2 a3a3 a4a xaxa xbxb xcxc xdxd xexe = a 2 - a 4 + a 3 a 4 – a 3 a 1 - a 3 a 4 - a 2

Utility of Sum Auditor We show, expected time to first denial  ¸ n/4  · n + lgn Good news for large databases – answers not riddled with denials Can’t do much better Once n-1 independent queries are answered, at least half the queries will be denied on average

Utility of Sum Auditor Reality  Users do not choose queries uniformly at random  Users cannot query arbitrary subsets of the data  Database frequently updated – old information becomes irrelevant e.g. q 1 = x a + x b + x c ; x a is modified q 2 = x a + x b q 2 will no longer be denied Denials may not be so frequent in reality

Utility: Experiments Plot 1: Sum queries chosen uniformly at random Plot 2: Sum queries with updates Plot 3: 1 dimensional range sum queries

Future Work Ways to proactively enhance utility  Deny innocuous queries in the present in the hope that more can be answered in the future Ward off denial of service attacks Devise auditors, study utility for more complex queries Remove assumptions about prior knowledge Solution to collusion