Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik.

Similar presentations


Presentation on theme: "1 A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik."— Presentation transcript:

1 1 A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik

2 2 Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] A client wants to store data on a remote server & run queries on it BUT he does not trust the server Solution: Encrypt the data & store it How do you query the encrypted data ? Encrypted & Indexed Client Data Server Untrusted Service Provider Query Post Processor Query Translator True Results Original Query Query over Encrypted Data Encrypted Results Trusted Client User

3 3 Data storage in DAS etupleshares A age A sal A CH$^*(G#!X2Y1Z1 ^$*D%L*#X3Y2Z2 *%GH%&)$X3Y3Z3 Original Table (plain text) R Server side Table (encrypted + indexed) R A Bucket-tags eidnameaddrsharesagesal 345TomMaple K 876MaryMain K 234JohnRiver K 780JerryOcean K Z0 Z1 Z2 Z3 Z4 buckets Meta data Server side data Client side storage

4 4 Querying in DAS etupleshares A age A sal A CH$^*(G#!X2Y1Z1 ^$*D%L*#X3Y2Z2 *%GH%&)$X3Y3Z3 Client side Table (plain text) R Server side Table (encrypted + indexed) R A Bucket-tags Client-side query Server-side query Select etuple from R A where R A.sal A = z1 ∨ z2 Select * from R where R.sal  [400K, 600K] eidnameaddrsharesagesal 345TomMaple K 876MaryMain K 234JohnRiver K 780JerryOcean K Client side Table (plain text) R

5 5 Issues in partitioning How many buckets should one use ? How to partition the data ?

6 6 Data Privacy in DAS Adversary Access to sever-side data + Malicious Intentions Privacy issue in partitioned data Small range of a bucket B + 1 sample value from B Privacy goal of client To hide all useful information from A Put all values of an attribute in a single bucket ! Adversary (A) “Almost total” disclosure of all elements in B

7 7 Research challenges & our contributions Precision: how to partition data  Definition  Optimal partitioning to maximize precision Privacy: quantifying disclosure  Adversary’s goals  Measures of information disclosure Privacy-Precision trade-off  Controlled diffusion algorithm  Experiments & Conclusion PrivacyPrecision

8 8 Precision of range queries Given a partition of data into M parts Precision (q) = 1 – (# false positives / # tuples returned for q) Recall = 1 Workload: All O(N 2 ) range queries are equiprobable (uniform) Salary (100K’s) Frequency N B =5,F B =18 N = 10 (domain size) q Precision = 1 – 20/50 = 0.6 # false positive α ∑ N B *F B = 5*32 + 5*18 = 250 B M = 2

9 9 Query optimal buckets (QOB) Optimization problem: For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e Salary (100K’s) Cost(8,10) Frequency QOB (1,7,3) +QOB (1,10,4) = Optimal solution to a sub-problemCost of rightmost bucket N B *F B = 24 B=1 Minimize ∑ N B *F B N = 10 (domain size) 4

10 10 QOB (cont.) Salary(100K’s) 4 B B2B3B4 Frequency Optimal cost = ∑ N B *F B = 12*3 + 20*2 + 10*2 + 8*3 = Time complexity = O(n 2 M), Space = O(nM) n = # distinct values in dataset; M = # buckets

11 11 Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion

12 12 Adversary’s learning model Need to learn bucket properties to estimate sensitive values Model A’s Domain knowledge + Sample values from buckets Worst case assumption for Privacy Analysis: A knows exact value distribution for every bucket A learns distribution of values in buckets

13 13 Adversarial Goal (I) Individual Centric Information: Eg: “What is the salary of an individual I” Value Estimation Power (VEP) of A Variance of bucket-distribution is an inverse measure of VEP Bucket range Average error of value estimation for Adversary Large Small Preferred: Large variance Small variance Bucket range

14 14 Adversarial Goal (II) Query Centric Information: Eg: “Which individuals have salary  [100k,150k]” Set Estimation Power (SEP) of A Entropy of bucket-distribution is an inverse measure of SEP* Bucket range Average error of query-set estimation for Adversary Small Large 100k150k 100k150k Best case: high entropy + large variance Bucket range low entropy + large variance H(X) = - ∑ p i logp i

15 15 Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion

16 16 Privacy-Precision Trade-off Optimal buckets might offer less privacy than desired  Small variance  partial disclosure of numeric value  Small entropy  Total disclosure with high probability (e.g. categorical data) Partial detection of query-sets (for all cases) Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy Objective

17 17 The controlled diffusion algorithm A simple observation B0B0 CB 1 CB 2 CB 3 Let a query Q overlap only with B 0 If elements of B 0 are distributed into CB 1, CB 2 & CB 3 randomly Now Q overlaps with CB 1, CB 2 & CB 3 With new buckets, the precision for Q drops by factor of (|CB 1 |+|CB 2 |+|CB 3 |) / |B 0 | Any re-distribution scheme where ∀ B i this ratio ≤ K  precision degradation is bounded above by K Q

18 18 Controlled diffusion Algorithm Compute optimal buckets on data set D  B 1 … B M Fix max degradation factor = K Initialize M empty composite buckets  CB 1 … CB M Set target size of each CB to f CB = |D|/M (equidepth) ∀ B i  select d i CB’s at random, where d i = K*|B i |/f CB  Diffuse elements of B i into these uniformly at random

19 Freq Values B1B2B3B CB1 CB3 CB2 CB4 CB1 CB2 CB3 CB4 Query optimal buckets Degradation factor k = 2 Composite Buckets Controlled Diffusion (Example) Final set of buckets on server Metadata size increases from O(M) to O(KM)

20 20 Some features of the diffusion algorithm Many consecutive optimal buckets might get diffused into common set of CB’s   Observed precision degradation < K Elements with same values can go to multiple buckets   Giving it an extra degree of freedom compared to hashing  Not best for point queries Random choice in the algorithm   Each bucket distribution approaches data distribution as K increases  reducing information gained by adversary by learning buckets

21 21 Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion

22 22 Experiments Data sets  Synthetic Data: 10 5 Integers in [0,999] uniformly at random  Real Data: 10 4 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive) Query workloads (2 of size 10 4 each)  End points chosen uniformly at random from the respective ranges

23 23 1.Relative decrease in precision of composite buckets 2.Relative increase in standard deviation in composite buckets 3.Relative increase in entropy in composite buckets

24 24 Composite buckets (sample) K = 6, M = 350K = 10, M = 250

25 25 Visualizing trade-offs for various bucketization parameters Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2 The same point in the precision vs standard deviation trade-off space  Provides an easy way to visualize the design space and choose parameters of interest

26 26 Summary An optimal algorithm for partitioning data for range queries Statistical measures of data privacy  Variance  Entropy Fast & simple algorithm for re-bucketizing data  Bounded amount of precision degradation  Substantial increase in privacy level

27 27 Related work Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”. Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”. Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.

28 28 THANK YOU ! Questions ?

29 29 Privacy in DAS Here goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general ! Privacy-preserving DM & Statistical DB DAS Privacy criteria: Protect against disclosure of identity Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible Privacy criteria: Hide as much information as possible (even at the aggregate level) Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy)

30 30 Individual Privacy Measure Average Squared Error of Estimation (ASEE) Error in approximating true value of a r.v X B by another r.v X B ’ (learned by A) ASEE(X B,X B ’) = Var(X B ) + Var(X B ’) + (E(X B ) – E(X B ’)) 2 Variance of bucket distribution, Var(X B ) is our measure of individual privacy (lower bound)

31 31 Set oriented Privacy Measure Entropy of bucket distribution is our measure for query-centric privacy  Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data)  An inverse measure of the quality of partial solution sets * that A can derive for a query H(X) = - ∑ p i logp i

32 32 Meta data size increase in diffusion The meta data increases from O(M) to K*|B 1 |/f cb + K*|B 2 |/f cb + … + K*|B M |/f cb = (K/f cb ) * (|B 1 | + |B 2 | + … + |B M |) = (KM/|D|)*|D| = O(KM)


Download ppt "1 A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik."

Similar presentations


Ads by Google