Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Privacy Preserving Index for Range Queries

Similar presentations


Presentation on theme: "A Privacy Preserving Index for Range Queries"— Presentation transcript:

1 A Privacy Preserving Index for Range Queries
Bijit Hore, Sharad Mehrotra, Gene Tsudik

2 Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002]
A client wants to store data on a remote server & run queries on it BUT he does not trust the server Solution: Encrypt the data & store it How do you query the encrypted data ? Untrusted Trusted True Results Encrypted Results 1) We start with some background for the present work 2) Hacigumus et. al in their SIGMOD 2002 paper, proposed an system architecture to Implement database as a service. 3) In their model only a limited trust is placed on the service provider. As a result data privacy becomes a key issue to address. 4) The performance criteria is to push as much query-processing as possible to the server side. 5) A simplified version of the proposed architecture is shown in the figure below Query Post Processor Encrypted & Indexed Client Data Server Query Translator Query over Encrypted Data User Original Query Service Provider Client

3 Data storage in DAS Client side storage Server side data Meta data
buckets Z Z Z Z Z4 Server side Table (encrypted + indexed) RA Original Table (plain text) R eid name addr shares age sal 345 Tom Maple 5400 32 390K 876 Mary Main 5800 22 423K 234 John River 6000 34 598K 780 Jerry Ocean 6200 48 632K 1) Let us look at the data storage model in the hacigumus architecture 2) A relational table is shown on the left and its server-side representation is shown on the right 3) The idea is to encrypt each row of the table and keep it as a single etuple on the server. 4) for each attribute to be queried, partition The values into buckets in some manner (equidepth or equiwidth) and store the bucket-tags Of these tuples as the indexing information on the server instead of Their true values. 5) Also store this value-to-bucket mapping information on the client, which Is used for translating user queries to server-side queries. (Server side queries can only Distinguish between values belonging to different buckets based on the Corresponding indexing information). etuple sharesA ageA salA X1 Y2 Z1 CH$^*(G#! X2 Y1 ^$*D%L*# X3 Z2 *%GH%&)$ Y3 Z3 Bucket-tags

4 Select * from R where R.sal  [400K, 600K]
Querying in DAS Select * from R where R.sal  [400K, 600K] Client-side query Server-side query Select etuple from RA where RA.salA = z1 ∨ z2 Server side Table (encrypted + indexed) RA Client side Table (plain text) R Client side Table (plain text) R eid name addr shares age sal 345 Tom Maple 5400 32 390K 876 Mary Main 5800 22 426K 234 John River 6000 34 598K 780 Jerry Ocean 6200 48 634K 1) Now let us look at a simple range query where the user wishes To select all records of individuals whose salary falls in the range 400K thousand To 600 thousand 2) Using the metadata, this query is translated to the server-side query where the server returns all rows whose salary value falls in Bucket Z1 or Z2. 3) As a result, 3 rows are selected and returned to the client. Client-side has to discard the one record which is a false positive Before returning the correct result to the user, thus incurring an Overhead. etuple sharesA ageA salA X1 Y2 Z1 CH$^*(G#! X2 Y1 ^$*D%L*# X3 Z2 *%GH%&)$ Y3 Z3 Bucket-tags

5 Issues in partitioning
How many buckets should one use ? How to partition the data ? One question is: How many buckets should one use ? One constraint on this is the bound on metadata. We answer these questions in the following slides:- The first one experimentally, the second one analytically.

6 “Almost total” disclosure of all elements in B
Data Privacy in DAS Adversary Access to sever-side data + Malicious Intentions Privacy issue in partitioned data Small range of a bucket B 1 sample value from B Privacy goal of client To hide all useful information from A Put all values of an attribute in a single bucket ! Adversary (A) “Almost total” disclosure of all elements in B Now let us look at the privacy issues in DAS An adversary is defined as 1) As we stated earlier, in the DAS model of hacigumus, the server side is Untrusted therefore GOAL is to hide ALL USEFUL INFORMATION from the Adversary. As a result In the partition-based approach, we would like to put all elements in a single bucket hence rendering all records indistinguishable from each other TOTAL PRIVACY 2) But in reality we need multiple buckets to enhance performance (More the # buckets, greater the average precision of queries) Then a scheme like equidepth or equiwidth bucketization might map. Each bucket to a small range of values 3) So if A has domain knowledge + sample values from buckets, He can localize the estimated value of an attribute to a small interval, hence resulting in disclosure of true values within a small margin of error.

7 Research challenges & our contributions
Precision: how to partition data Definition Optimal partitioning to maximize precision Privacy: quantifying disclosure Adversary’s goals Measures of information disclosure Privacy-Precision trade-off Controlled diffusion algorithm  Experiments & Conclusion 1) In this paper, we restrict the study to the domain of range queries over a static database with a single attribute. 2) The remaining talk is structured as follows: 3) We outline an optimal algorithm for optimally partitioning a dataset Into a specified number of buckets, that maximizes query precision, that is Minimizes the number of false positives. 4) Then we describe our adversarial model Which is relevant to the DAS scenario. We Identify a couple of important adversarial goals And respective measures of information disclosures In view of these goals. We will refer to these as our “ privacy measures” equivalently. 5) Subsequently, we propose a simple algorithm Data re-distribution algorithm that balances privacy And precision of range queries. 6) Finally we end with some experimental results and conclusions. Privacy Precision

8 Precision of range queries
Given a partition of data into M parts Precision (q) = 1 – (# false positives / # tuples returned for q) Recall = 1 Workload: All O(N2) range queries are equiprobable (uniform) # false positive α ∑ NB*FB = 5*32 + 5*18 = 250 B Precision = 1 – 20/50 = 0.6 q M = 2 10 10 Frequency NB=5,FB=18 6 4 4 4 4 4 N = 10 (domain size) 2 2 Salary (100K’s)

9 Query optimal buckets (QOB)
Optimization problem: For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e. 4 Minimize ∑ NB*FB B=1 Optimal solution to a sub-problem Cost of rightmost bucket QOB (1,10,4) = QOB (1,7,3) Cost(8,10) 10 10 1) To keep analysis simple, we consider a discrete Numeric domain of size N for the data. 2) We make the assumption on the query workload; That is all the O(N^2) queries are equi-probable. 3) Let us take the following example where the data Set has 10 distinct values, with the frequencies as shown. 4) Say we want to partition these into 4 buckets. Now the goal is to minimize the TOTAL # of False positives over all queries, which translates To “minimizing the false positives contributed By each bucket” 5) It is shown in the paper that a single bucket cost is proportional to (the range of the bucket) times (the total # elements in it.) 6) This problem has “the optimal substructure property” Which makes it amenable to be solved using dynamic Programming. For this particular example, the optimal cost can be written as the sum of the optimal cost of the smaller left-hand problem using 3 buckets + the cost of the shown rightmost single bucket. Frequency NB*FB = 24 6 4 4 4 4 4 N = 10 (domain size) 2 2 Salary (100K’s)

10 Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 110
QOB (cont.) 4 Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 110 1 B1 B2 B3 B4 10 10 6 Frequency 4 4 4 4 4 2 2 1) Completing the example, this shows The optimal partition of the dataset. Note that this partitioning is not Equidepth or equiwidth in general. 2) The time complexity of the algorithm is O(n^2*M), where n is the # distinct values Occuring in the dataset. 3) The space complexity is O(nM) due to The matrices stored in the DP algorithm 4) Though we made the assumption on Equi-probability of all queries, we can also tackle the case where query workload follows Any distribution. It simply needs an additional linear pre-processing phase to enable computation of bucket costs in the algorithm. Salary(100K’s) Time complexity = O(n2M), Space = O(nM) n = # distinct values in dataset; M = # buckets

11 Optimal data partitioning for range queries
Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion 1) We have an algorithm for optimal data partitioning to answer range queries. 2) Now let us look at the adversarial model And relevant privacy measures

12 Adversary’s learning model
Need to learn bucket properties to estimate sensitive values Model A’s Domain knowledge + Sample values from buckets Worst case assumption for Privacy Analysis: A knows exact value distribution for every bucket A learns distribution of values in buckets 1) As is evident from the data representation In DAS, the adversary needs to learn the bucket Properties in order to infer something useful about the sensitive data Values in a bucket. 2) Therefore in our model, the adversary, along With his general domain knowledge, also has access To sample of values belonging to each bucket. Hence enabling the adversary to learn statistical properties of the ensemble of data elements within each bucket. 3) In the most generic case, we assume that A learns The distribution of the values, within each bucket. 4) As a result we are now interested in the information leakage that happens due to A’s learning each bucket Distributions. 5) For our analysis purposes, we assume the worst case Scenario, where the adversary has learned the exact distribution For each bucket. For instance, in the discrete numeric domain, The adversary would no the values in a bucket and their respective Frequencies.

13 Individual Centric Information:
Adversarial Goal (I) Individual Centric Information: Eg: “What is the salary of an individual I” Value Estimation Power (VEP) of A Variance of bucket-distribution is an inverse measure of VEP Now Let us see what might comprise of “useful information” for the adversary 2) One type could be individual centric information An example could be as shown:- what is the salary of An individual I. 3) We refer to the adversary’s potential to answer Such queries, by his “Value Estimation Power” (VEP). Equivalently we can ask that when A knows that a random variable follows a specific distribution, how well can he localize the value of an instance of the r.v ? 4) As we show in the paper, the average error in A’s estimation (guess) of this value is bounded below by the Variance of the distribution it comes from, i.e. the bucket-distribution. 5) Therefore variance of the bucket distribution is an Appropriate Measure of the adversary’s Value-estimation-power. 6) Hence more the variance, lesser the chance of disclosure. Average error of value estimation for Adversary Preferred: Large variance Small variance Large Small Bucket range Bucket range

14 Query Centric Information:
Adversarial Goal (II) Query Centric Information: Eg: “Which individuals have salary  [100k,150k]” Set Estimation Power (SEP) of A Entropy of bucket-distribution is an inverse measure of SEP* Best case: high entropy + large variance 1) We identify a second kind of information that the adversary might be interested in: That we call “Query-Centric Information” 2) A example could be where, “The adversary wants to identify all the records whose salary Value falls in a given range, say [100K, 150K]. 3) We denote the adversary’s ability to answer Such queries by his “Set Estimation Power” (SEP). 4) As it turns out, Entropy is an appropriate measure Of SEP. We show in the extended version of the paper How a notion of partial correctness of query sets can be tied to the average entropy of buckets in this Application. 5) Look at the example on the left, entropy of this distribution is small, though variance is large. But since about half the elements are known by the adversary, to be in the range [100k, 150k], a random classification of elements into the 2 classes: “belongs to this range” and “does not belongs to this range” will have a pretty High accuracy. 6) Besides, entropy is an universal measure of uncertainty associated with a random variable. Therefore we propose the entropy of the bucket Distributions to be our second measure of privacy. Average error of query-set estimation for Adversary low entropy + large variance Large Small 100k 150k 100k 150k H(X) = - ∑ pilogpi Bucket range Bucket range

15 Optimal data partitioning for range queries
Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion 1) Now we look at the privacy-precision trade-off in data partitioning scheme and Propose a new algorithm to obtain the desired Degree of balance between the two.

16 Privacy-Precision Trade-off
Optimal buckets might offer less privacy than desired Small variance  partial disclosure of numeric value Small entropy  Total disclosure with high probability (e.g. categorical data) Partial detection of query-sets (for all cases) Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy 1) As is clearly evident from the discussion Up to now, there is a clear trade-off between Privacy attained and precision of range queries. Query optimal buckets might offer a very small Degree of privacy. 2) Small variance of buckets can lead to partial Disclosure of sensitive numeric values. 3) Whereas small entropy can lead to Total disclosure with a high probability (as in Case of categorical data) or atleast detection Of partially correct query sets for even numeric Data. 4) Next we describe our trade-off algorithm that allows the client to partition his data In a controlled manner, thereby achieving the desired degree of balance between privacy and precision. Objective

17 The controlled diffusion algorithm
A simple observation Q Let a query Q overlap only with B0 If elements of B0 are distributed into CB1, CB2 & CB3 randomly Now Q overlaps with CB1, CB2 & CB3 With new buckets, the precision for Q drops by factor of (|CB1|+|CB2|+|CB3|) / |B0| Any re-distribution scheme where ∀ Bi this ratio ≤ K  precision degradation is bounded above by K B0 1) As is clearly evident from the discussion Up to now, there is a clear trade-off between Privacy attained and precision of range queries. Query optimal buckets might offer a very small Degree of privacy. 2) Small variance of buckets can lead to partial Disclosure of sensitive numeric values. 3) Whereas small entropy can lead to Total disclosure with a high probability (as in Case of categorical data) or atleast detection Of partially correct query sets for even numeric Data. 4) Next we describe our trade-off algorithm that allows the client to partition his data In a controlled manner, thereby achieving the desired degree of balance between privacy and precision. CB1 CB2 CB3

18 Controlled diffusion Algorithm
Compute optimal buckets on data set D  B1 … BM Fix max degradation factor = K Initialize M empty composite buckets  CB1 … CBM Set target size of each CB to fCB = |D|/M (equidepth) ∀ Bi select di CB’s at random, where di = K*|Bi|/fCB Diffuse elements of Bi into these uniformly at random

19 Controlled Diffusion (Example)
Query optimal buckets Degradation factor k = 2 Metadata size increases from O(M) to O(KM) Freq 10 10 10 B1 B2 B3 B4 6 Final set of buckets on server 4 4 4 4 4 2 2 Values CB1 CB1 CB2 CB2 CB3 CB3 CB4 CB4 Composite Buckets

20 Some features of the diffusion algorithm
Many consecutive optimal buckets might get diffused into common set of CB’s  Observed precision degradation < K Elements with same values can go to multiple buckets  Giving it an extra degree of freedom compared to hashing Not best for point queries Random choice in the algorithm  Each bucket distribution approaches data distribution as K increases  reducing information gained by adversary by learning buckets

21 Optimal data partitioning for range queries
Outline Optimal data partitioning for range queries Adversarial goals & privacy measures Balancing privacy and precision Experiments & conclusion 1) So now we have given an easy algorithm To partition data, which allows One to trade-off bounded amount of query Precision to enhance data privacy. 2) Let us now see some experimental results To see how effective this method is.

22 Query workloads (2 of size 104 each)
Experiments Data sets Synthetic Data: 105 Integers in [0,999] uniformly at random Real Data: 104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive) Query workloads (2 of size 104 each) End points chosen uniformly at random from the respective ranges

23 Relative decrease in precision of composite buckets
Relative increase in standard deviation in composite buckets Relative increase in entropy in composite buckets 1) The top figure shows the ratio of Average query precision using optimal Buckets to that using composite buckets Using the same # of buckets. Each set of Points is for a fixed value of the “maximum Allowed Degradation factor” K. 2) The bottom left figure shows the average Increase in standard deviation of the Distribution within CBs to that within the optimal Buckets, again for various values of K, against Increasing # of buckets. 3) The bottom right picture shows it for the Entropy ratio

24 Composite buckets (sample)
K = 6, M = 350 K = 10, M = 250

25 Visualizing trade-offs for various bucketization parameters
Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2 The same point in the precision vs standard deviation trade-off space  Provides an easy way to visualize the design space and choose parameters of interest 1) These plots show the privacy and Performance in a trade-off space. 2) Each class of points refers to a Fixed value of K and each point in a Class denotes the characteristics of For the total number of buckets used. 3) For example, the marked points show The average entropy against average precision For when 100 composite buckets are used, With a degradation factor of 2. Similarly the lower figure shows the corresponding Average standard deviation for these 100 Buckets.

26 An optimal algorithm for partitioning data for range queries
Summary An optimal algorithm for partitioning data for range queries Statistical measures of data privacy Variance Entropy Fast & simple algorithm for re-bucketizing data Bounded amount of precision degradation Substantial increase in privacy level

27 Related work Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”. Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”. Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.

28 THANK YOU ! Questions ?

29 Privacy-preserving DM & Statistical DB
Privacy in DAS Here goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general ! Privacy-preserving DM & Statistical DB DAS Privacy criteria: Hide as much information as possible (even at the aggregate level) Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy) Privacy criteria: Protect against disclosure of identity Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible

30 Individual Privacy Measure
Average Squared Error of Estimation (ASEE) Error in approximating true value of a r.v XB by another r.v XB’ (learned by A) ASEE(XB,XB’) = Var(XB) + Var(XB’) + (E(XB) – E(XB’))2 Variance of bucket distribution, Var(XB) is our measure of individual privacy (lower bound) We first define the term: “Average Squared Error of Estimation” for A: ASEE(X, X’) is simply a measure of the error that A is expected to make when approximating the value of a random variable X by X’ It is shown in the paper that ASEE is given by the following expression when X and X’ are Independent. Var(X) denotes variance of X and E(X) denotes the expected value of X. Therefore we see that ASEE is lower bounded by the variance of the true distribution of X, which We can always compute irrespective of the variance of the estimator random variable ! Intuitively, we are saying that if an element belongs to a diverse, heterogeneous set of elements, then Less information is disclosed regarding the individual. (NOTE: In the single attribute case, X and X’ are independent, but it might not be the case in bucketization over multiple related attributes)

31 Set oriented Privacy Measure
Entropy of bucket distribution is our measure for query-centric privacy Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data) An inverse measure of the quality of partial solution sets* that A can derive for a query H(X) = - ∑ pilogpi Pi denotes the proportion of elements from class I Shannon entropy is easy to visualize in terms of discrete variables, say like categorical data Shannon entropy is a well accepted measure of “uncertainty” associated with a random variable. Entropy increases as a distribution approaches uniformity and the increase in domain size. Entropy is also shown to be an inverse measure of A’s average ability to determine “partially correct” query-sets. i.e. on an average how accurately can A select a set of elements which have values from a specified range. * Refer to extended paper for more rigorous definition of entropy and its connection to query centric privacy

32 Meta data size increase in diffusion
The meta data increases from O(M) to K*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb = (K/fcb) * (|B1| + |B2| + … + |BM|) = (KM/|D|)*|D| = O(KM)


Download ppt "A Privacy Preserving Index for Range Queries"

Similar presentations


Ads by Google