Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
CS4432: Database Systems II
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
CSE 330: Numerical Methods
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.
Recent Development on Elimination Ordering Group 1.
Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Distributed Combinatorial Optimization
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Experimental Evaluation
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
CS573 Data Privacy and Security Statistical Databases
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
ESTIMATES AND SAMPLE SIZES
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Chapter 6 Integration Section 4 The Definite Integral.
Offering a Precision- Performance Tradeoff for Aggregation Queries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Approximation Algorithms based on linear programming.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Da Yan, Raymond Chi-Wing Wong, and Wilfred Ng The Hong Kong University of Science and Technology.
CSE 330: Numerical Methods. What is true error? True error is the difference between the true value (also called the exact value) and the approximate.
University of Texas at El Paso
SIMILARITY SEARCH The Metric Space Approach
Private Data Management with Verification
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Spatial Online Sampling and Aggregation
Differential Privacy in Practice
The Subset Sum Game Revisited
Depth Estimation via Sampling
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Clustering.
Efficient Processing of Top-k Spatial Preference Queries
Efficient Aggregation over Objects with Extent
Presentation transcript:

Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie

Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

Problem Settings Microdata: sensitive personal data held by an organization, e.g. medical records, transaction history. Often open to public access for reasons such as research.

Risk to Privacy An attacker knows the age 20 and zipcode of Alice. In order to infer Alice’s income, s/he issues 2 queries: q 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ [80k, +∞) q’ 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ (-∞, 80k) Table 1:

Solutions Output Perturbation: injecting a small random noise into each query result. ε-differential Privacy: Let Q be the set of previously answered queries. Given a new query q, the database determines whether {q} ∪ Q violates ε-differential privacy.

Output Perturbation Count Queries: SELECT COUNT(*) FROM T where pred(A 1 ) AND... AND pred(A d ), such that pred(A i ) has the format A i = * or A i ∈ [x i, y i ] Perturbed Answer: given a query q, D returns an answer q(T) + δ, where δ is a random variable subjects to Laplace distribution: f(δ) = (1/2λ) * e -|δ| / λ where λ is the noise magnitude.

ε-Differential Privacy Sibling Tables: two microdata tables T 1 and T 2 that have the same schema and cardinality and differ in only one tuple. e.g. we change Alice’s income from 85k to 30k. ε-Differential Privacy: Let Q = {q 1,..., q m } be any subset of the queries that have been answered by D, and R = {r 1,..., r m } be a set of arbitrary real numbers. D ensures ε-Differential Privacy, if the following inequality holds for any R and any pair of sibling tables T 1 and T 2 : Pr[ ∀i, q i (D) = r i | Δ 1 ] <= e ε * Pr[ ∀i, q i (D) = r i | Δ 2 ] where Δ i denotes the event that T i is the table where D is constructed.

ε-Differential Privacy: An Example A statistical database D is built on T 1. Q is the set of queries issued by an attacker, and S rst is the set of result returned by D. Assume D is constructed on another table T 2 where Alice’s income is arbitrarily modified, which may still return S rst. Pr[ D returns S rst | Alice’s income is NOT modified ] <= e ε * Pr[ D returns S rst | Alice’s income is modified ] e ε ≈ 1 + ε, which is close to 1. A smaller ε leads to better privacy.

Computation of ε-Differential Privacy L 1 Sensitivity: given a set Q of queries, its L1 sensitivity equals: S L1 (Q) = max T1, T2 ( ∑ q∈Q |q(T 1 ) - q(T 2 )| ) where T 1 and T 2 are any two sibling tables. An example: Q = {q 0, q 0 ’ }. T 1 is table 1, T 2 changes Alice's income to be 30K. We show that S L1 (Q) = 2. |q 0 (T 1 ) – q 0 (T 2 )| <= 1 and |q 0 ’ (T 1 ) – q 0 ’ (T 2 )| <= 1, so S L1 (Q) <= 2. q 0 (T 1 ) = 1, q 0 (T 2 ) = 0, q 0 ’ (T 1 ) = 0, q 0 ’ (T 2 ) = 1, so S L1 (Q) >= |1 - 0| + |0 - 1| = 2. So S L1 (Q) = 2.

Computation of ε-Differential Privacy Theorem 1: A statistical database D ensures ε-differential privacy, if and only if S L1 (Q) <= ελ. Lemma 1: Deciding whether S L1 (Q) is larger than a threshold is NP- hard. Proof: a reduction from the maximum 2-satisfiability (MAX-2-SAT) problem So the verification of ε-differential privacy is NP-hard.

Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

Some Definitions Data Space/Query Region: We regard the data space Ω of a table T as a d-dimensional space, where the i-th dimension is A i. The region of a query q is a rectangle r in Ω such that if q has a predicate “A i ∈ [x i, y i ]”, the projection of r on A i equals [x i, y i ]. Popularity/Convergence: For any point p in the data space Ω, its popularity p(Q) is the number of query regions that cover p. The convergence of Q is the largest p(Q) of all points p ∈ Ω.

The Upper Bound of S L1 (Q) Lemma 2: For any set Q of queries, S L1 (Q) <= 2C(Q). Proof: This bound motivates a simple approach to ensure ε- differential privacy.

A Histogram Approach The above approach requires keeping values for all points, which is not practical. We can maintain a histogram H, which partitions the data space Ω into rectangular buckets. Each bucket B has a counter B.c to record the number of queries that intersect it. If B.c <= λε/2, the ε-differential privacy is preserved. If a new query intersects a bucket with counter greater than or equal to λε/2, it’s rejected.

A Histogram Approach: Simple Split The initial number of bucket is one, and a bucket B can be split in a way to minimize B’.c + B’’.c, if needed. The largest number of buckets θ is a system parameter. An example where the maximum permissible popularity λε/2 is 3:

A Histogram Approach: the Split Algorithm Algorithm Split (B) /* B is a bucket to be decomposed */ 1. U = the set of regions of the queries in Q that partially intersect B 2. if U ≠ ∅ ; 3. remove B from H 4. r ∩ = the intersection of all the regions in U 5. if r ∩ = ∅ ; 6. split B into buckets B’ and B’’ with the minimum B’.c + B’’.c using the cutting lines passing the boundaries of the regions in U 7. else 8. repetitively split B by the cutting lines passing the boundaries of r ∩ until a bucket has extent r ∩ 9. insert the new buckets into H with counters set to B.c

A Histogram Approach: A Complex Split Query q 4 : SELECT COUNT(*) FROM T where age = * AND INCOME ∈ [40000, 99999]

Limitation of Output Perturbation Volume of a query: the percentage of points in Ω that satisfy the query. For a solution that 1) ensures ε-differential privacy and 2) perturbs each answer with Laplace noise of magnitude λ, let θ be the max. number of queries that can be processed by such a solution, then: if each query has a volume at least s’ and at most 1-s’ (0 < s’ <= 1/2), θ < λε / s’. For queries with volume in (0, 1), the above solution can process at most n * λε queries.

Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

Query Relaxation If the maximum number of supported queries is reached, new queries are all rejected. Instead of simply refusing a query, we may return a useful synthetic answer, which is based on previously answered queries, thus the privacy is not violated. This process is called relaxation. An example: q 1 ’: SELECT COUNT(*) FROM T WHERE Age ∈ [20, 51] AND Income ∈ [40K, 70K] q 1 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 50] AND Income ∈ [40K, 70K]

Query Relaxation: Compound Two disjoint sets P + and P - of queries constitute a compound P, if 1) for each point p in Ω, p(P + ) - p(P - ) equals 0 or 1. 2) All points p satisfying p(P + ) - p(P - ) = 1 form a rectangle r diff, which is the difference region of P. A synthetic answer of P is calculated by ∑ q∈P+ q(D) - ∑ q∈P- q(D)

Relaxation Error Relaxation Error E(P,q) can be calculated using the formula below: Let Q be a set of accepted queries and P a compound. A query q ∈ Q but not in P is a positive (negative) patch if after including it in P + (P - ), 1) P remains a compound and 2) E(P, q*) decreases.

Artificial Patches We can dynamically generate a query, force the database to process it normally, and use its perturbed answer to obtain a better synthetic answer for the denied query. 2d artificial queries are generated, each of which aligns with a boundary of r diff. Then each query is checked whether it’s a patch and it violates the ε-differential privacy or not.

Probabilistic Accuracy We return a synthetic answer ∑ q∈P+ q(D) - ∑ q∈P- q(D) as well as a relaxed query q*’. The synthetic answer has the expected value q*’(T), and its variance is 2λ 2 * | P + ∪ P - |, where λ is the noise magnitude. A tradeoff: more queries in P lowers the relaxation error, but increase the noise in the query results. So the user may specify an upper bound ξ of the size of a compound.

An Illustration of Relaxation

Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

Experiment Settings Dataset: CENSUS Computer: 3G Pentium IV, 1G RAM. Parameters: Queries: select count(*) from CENSUS where A 1 ∈ [x 1, y 1 ] and A 2 ∈ [x 2, y 2 ]. The center z i of the range [x i, y i ] is chosen in 2 different ways: 1) Data: z i = t [A i ], where t is a random tuple. 2) Uniform: z i is a random value in the domain of A i. The workload of queries is 20K.

Experiment: Processing Capability Without Relaxation Two approaches: Disjoint: reject a query if its region intersects any of the previously answered query. Histogram.

Experiment: Processing Capability Without Relaxation Effects of ε and s: The upper bound of capacity: n * λε. Queries with larger regions cause faster growth of C(Q).

Experiment: Quality of Relaxation Effects of compound size: A larger compound raises the chance of finding a good compound. The compound size can be well below the bound ξ because of early termination.

Experiment: Quality of Relaxation Effects of ε: A greater ε allows more queries, thus a larger query set Q for relaxation, which enhances the relaxation quality.

Experiment: Quality of Relaxation Effects of s: Queries with larger regions cause faster growth of C(Q), which results in a smaller query set Q and a higher relaxation error.

Experiment: Computation Overhead Greater ε (s) results in higher (lower) query process capacity and the size of query set Q. Greater ξ ( θ) results in larger compounds (more buckets).

Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

Conclusion & Future Works Propose an applicable solution (the histogram) to ensure ε-differential privacy. Use query relaxation to overcome the limitation of query processing capacity. Future works:  Apply to other kinds of queries (SUM, MIN, MAX, etc.)  Consider update of database.  Other types of microdata besides relational tables.

THANKS