Presentation on theme: "A Privacy Preserving Index for Range Queries"— Presentation transcript:
1A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik
2Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] A client wants to store data on a remote server & run queries on itBUT he does not trust the serverSolution: Encrypt the data & store itHow do you query the encrypted data ?UntrustedTrustedTrue ResultsEncrypted Results1) We start with some backgroundfor the present work2) Hacigumus et. al in their SIGMOD 2002paper, proposed an system architecture toImplement database as a service.3) In their model only a limited trust is placedon the service provider. As a result data privacybecomes a key issue to address.4) The performance criteria is to push as muchquery-processing as possible to the server side.5) A simplified version of the proposedarchitecture is shown in the figure belowQuery Post ProcessorEncrypted & Indexed Client DataServerQuery TranslatorQuery over Encrypted DataUserOriginal QueryService ProviderClient
3Data storage in DAS Client side storage Server side data Meta data bucketsZ Z Z Z Z4Server side Table (encrypted + indexed) RAOriginal Table (plain text) Reidnameaddrsharesagesal345TomMaple540032390K876MaryMain580022423K234JohnRiver600034598K780JerryOcean620048632K1) Let us look at the data storage modelin the hacigumus architecture2) A relational table is shown on the leftand its server-side representation is shownon the right3) The idea is to encrypt each row of the tableand keep it as a single etuple on the server.4) for each attribute to be queried, partitionThe values into buckets in some manner(equidepth or equiwidth) and store the bucket-tagsOf these tuples as the indexing information on theserver instead of Their true values.5) Also store this value-to-bucket mapping informationon the client, which Is used for translating user queriesto server-side queries.(Server side queries can only Distinguish betweenvalues belonging to different buckets based on theCorresponding indexing information).etuplesharesAageAsalAX1Y2Z1CH$^*(G#!X2Y1^$*D%L*#X3Z2*%GH%&)$Y3Z3Bucket-tags
4Select * from R where R.sal [400K, 600K] Querying in DASSelect * from R where R.sal [400K, 600K]Client-side queryServer-side querySelect etuple from RA where RA.salA = z1 ∨ z2Server side Table (encrypted + indexed) RAClient side Table (plain text) RClient side Table (plain text) Reidnameaddrsharesagesal345TomMaple540032390K876MaryMain580022426K234JohnRiver600034598K780JerryOcean620048634K1) Now let us look at a simple range querywhere the user wishes To select all recordsof individuals whose salary falls in the range400K thousand To 600 thousand2) Using the metadata, this query istranslated to the server-side query wherethe server returns all rows whose salaryvalue falls in Bucket Z1 or Z2.3) As a result, 3 rows are selected andreturned to the client. Client-side has to discardthe one record which is a false positive Beforereturning the correct result to the user, thusincurring an Overhead.etuplesharesAageAsalAX1Y2Z1CH$^*(G#!X2Y1^$*D%L*#X3Z2*%GH%&)$Y3Z3Bucket-tags
5Issues in partitioning How many buckets should one use ?How to partition the data ?One question is: How many buckets should one use ?One constraint on this is the bound on metadata.We answer these questions in the following slides:-The first one experimentally, the second one analytically.
6“Almost total” disclosure of all elements in B Data Privacy in DASAdversaryAccess to sever-side data+Malicious IntentionsPrivacy issue in partitioned dataSmall range of a bucket B1 sample value from BPrivacy goal of clientTo hide all useful information from APut all values of an attribute in a single bucket !Adversary (A)“Almost total” disclosure of all elements in BNow let us look at the privacy issues in DASAn adversary is defined as1) As we stated earlier, in the DASmodel of hacigumus, the server side isUntrusted therefore GOAL is to hideALL USEFUL INFORMATION from theAdversary. As a result In the partition-basedapproach, we would like to put all elementsin a single bucket hence rendering all recordsindistinguishable from each other TOTAL PRIVACY2) But in reality we need multiple buckets toenhance performance (More the # buckets,greater the average precision of queries)Then a scheme like equidepth or equiwidthbucketization might map. Each bucket to a smallrange of values3) So if A has domain knowledge + samplevalues from buckets, He can localize the estimatedvalue of an attribute to a small interval, henceresulting in disclosure of true values within a smallmargin of error.
7Research challenges & our contributions Precision: how to partition dataDefinitionOptimal partitioning to maximize precisionPrivacy: quantifying disclosureAdversary’s goalsMeasures of information disclosurePrivacy-Precision trade-offControlled diffusion algorithm Experiments & Conclusion1) In this paper, we restrict the studyto the domain of range queries over astatic database with a single attribute.2) The remaining talk is structured as follows:3) We outline an optimal algorithm for optimallypartitioning a dataset Into a specified number ofbuckets, that maximizes query precision, that isMinimizes the number of false positives.4) Then we describe our adversarial modelWhich is relevant to the DAS scenario. WeIdentify a couple of important adversarial goalsAnd respective measures of information disclosuresIn view of these goals. We will refer to these as our“ privacy measures” equivalently.5) Subsequently, we propose a simple algorithmData re-distribution algorithm that balances privacyAnd precision of range queries.6) Finally we end with some experimentalresults and conclusions.PrivacyPrecision
8Precision of range queries Given a partition of data into M partsPrecision (q) = 1 – (# false positives / # tuples returned for q)Recall = 1Workload: All O(N2) range queries are equiprobable (uniform)# false positive α ∑ NB*FB = 5*32 + 5*18 = 250BPrecision =1 – 20/50 = 0.6qM = 21010FrequencyNB=5,FB=18644444N = 10(domain size)22Salary (100K’s)
9Query optimal buckets (QOB) Optimization problem:For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e.4Minimize ∑ NB*FBB=1Optimal solution to a sub-problemCost of rightmost bucketQOB (1,10,4) =QOB (1,7,3)Cost(8,10)10101) To keep analysis simple, we consider a discreteNumeric domain of size N for the data.2) We make the assumption on the query workload;That is all the O(N^2) queries are equi-probable.3) Let us take the following example where the dataSet has 10 distinct values, with the frequencies as shown.4) Say we want to partition these into 4 buckets.Now the goal is to minimize the TOTAL # ofFalse positives over all queries, which translatesTo “minimizing the false positives contributedBy each bucket”5) It is shown in the paper that a single bucket costis proportional to (the range of the bucket) times (the total# elements in it.)6) This problem has “the optimal substructure property”Which makes it amenable to be solved using dynamicProgramming. For this particular example, the optimalcost can be written as the sum of the optimal cost of thesmaller left-hand problem using 3 buckets + the cost ofthe shown rightmost single bucket.FrequencyNB*FB = 24644444N = 10(domain size)22Salary (100K’s)
10Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 110 QOB (cont.)4Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 1101B1B2B3B410106Frequency44444221) Completing the example, this showsThe optimal partition of the dataset.Note that this partitioning is notEquidepth or equiwidth in general.2) The time complexity of the algorithm isO(n^2*M), where n is the # distinct valuesOccuring in the dataset.3) The space complexity is O(nM) due toThe matrices stored in the DP algorithm4) Though we made the assumption onEqui-probability of all queries, we can alsotackle the case where query workload followsAny distribution. It simply needs an additionallinear pre-processing phase to enable computationof bucket costs in the algorithm.Salary(100K’s)Time complexity = O(n2M), Space = O(nM)n = # distinct values in dataset; M = # buckets
11Optimal data partitioning for range queries OutlineOptimal data partitioning for range queriesAdversarial goals & privacy measuresBalancing privacy and precisionExperiments & conclusion1) We have an algorithm for optimaldata partitioning to answer range queries.2) Now let us look at the adversarial modelAnd relevant privacy measures
12Adversary’s learning model Need to learn bucket properties to estimatesensitive valuesModelA’s Domain knowledge+Sample values from bucketsWorst case assumption for Privacy Analysis:A knows exact value distribution for every bucketA learns distribution of values in buckets1) As is evident from the data representationIn DAS, the adversary needs to learn the bucketProperties in order to infer something useful aboutthe sensitive data Values in a bucket.2) Therefore in our model, the adversary, alongWith his general domain knowledge, also has accessTo sample of values belonging to each bucket.Hence enabling the adversary to learn statistical propertiesof the ensemble of data elements within each bucket.3) In the most generic case, we assume that A learnsThe distribution of the values, within each bucket.4) As a result we are now interested in the informationleakage that happens due to A’s learning each bucketDistributions.5) For our analysis purposes, we assume the worst caseScenario, where the adversary has learned the exact distributionFor each bucket. For instance, in the discrete numeric domain,The adversary would no the values in a bucket and their respectiveFrequencies.
13Individual Centric Information: Adversarial Goal (I)Individual Centric Information:Eg: “What is the salary of an individual I”Value Estimation Power (VEP) of AVariance of bucket-distribution is an inversemeasure of VEPNow Let us see what might comprise of“useful information” for the adversary2) One type could be individual centric informationAn example could be as shown:- what is the salary ofAn individual I.3) We refer to the adversary’s potential to answerSuch queries, by his “Value Estimation Power” (VEP).Equivalently we can ask that when A knows that a randomvariable follows a specific distribution, how well can helocalize the value of an instance of the r.v ?4) As we show in the paper, the average error inA’s estimation (guess) of this value is bounded belowby the Variance of the distribution it comes from,i.e. the bucket-distribution.5) Therefore variance of the bucket distribution is anAppropriate Measure of the adversary’s Value-estimation-power.6) Hence more the variance, lesser the chance of disclosure.Average error of value estimation for AdversaryPreferred: Large varianceSmall varianceLargeSmallBucket rangeBucket range
14Query Centric Information: Adversarial Goal (II)Query Centric Information:Eg: “Which individuals have salary [100k,150k]”Set Estimation Power (SEP) of AEntropy of bucket-distribution is an inversemeasure of SEP*Best case: high entropy + large variance1) We identify a second kind of informationthat the adversary might be interested in:That we call “Query-Centric Information”2) A example could be where, “The adversarywants to identify all the records whose salaryValue falls in a given range, say [100K, 150K].3) We denote the adversary’s ability to answerSuch queries by his “Set Estimation Power” (SEP).4) As it turns out, Entropy is an appropriate measureOf SEP. We show in the extended version of the paperHow a notion of partial correctness of query setscan be tied to the average entropy of buckets in thisApplication.5) Look at the example on the left, entropy of thisdistribution is small, though variance is large. But sinceabout half the elements are known by the adversary,to be in the range [100k, 150k], a random classificationof elements into the 2 classes: “belongs to this range”and “does not belongs to this range” will have a prettyHigh accuracy.6) Besides, entropy is an universal measure ofuncertainty associated with a random variable.Therefore we propose the entropy of the bucketDistributions to be our second measure of privacy.Average error of query-set estimation for Adversarylow entropy + large varianceLargeSmall100k150k100k150kH(X) = - ∑ pilogpiBucket rangeBucket range
15Optimal data partitioning for range queries OutlineOptimal data partitioning for range queriesAdversarial goals & privacy measuresBalancing privacy and precisionExperiments & conclusion1) Now we look at the privacy-precisiontrade-off in data partitioning scheme andPropose a new algorithm to obtain the desiredDegree of balance between the two.
16Privacy-Precision Trade-off Optimal buckets might offer less privacy than desiredSmall variance partial disclosure of numeric valueSmall entropy Total disclosure with high probability (e.g. categorical data)Partial detection of query-sets (for all cases)Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy1) As is clearly evident from the discussionUp to now, there is a clear trade-off betweenPrivacy attained and precision of range queries.Query optimal buckets might offer a very smallDegree of privacy.2) Small variance of buckets can lead to partialDisclosure of sensitive numeric values.3) Whereas small entropy can lead toTotal disclosure with a high probability (as inCase of categorical data) or atleast detectionOf partially correct query sets for even numericData.4) Next we describe our trade-off algorithmthat allows the client to partition his dataIn a controlled manner, thereby achieving thedesired degree of balance between privacy andprecision.Objective
17The controlled diffusion algorithm A simple observationQLet a query Q overlap only with B0If elements of B0 are distributedinto CB1, CB2 & CB3 randomlyNow Q overlaps with CB1, CB2 & CB3With new buckets, the precision for Q drops by factor of(|CB1|+|CB2|+|CB3|) / |B0|Any re-distribution scheme where ∀ Bi this ratio ≤ K precision degradation is bounded above by KB01) As is clearly evident from the discussionUp to now, there is a clear trade-off betweenPrivacy attained and precision of range queries.Query optimal buckets might offer a very smallDegree of privacy.2) Small variance of buckets can lead to partialDisclosure of sensitive numeric values.3) Whereas small entropy can lead toTotal disclosure with a high probability (as inCase of categorical data) or atleast detectionOf partially correct query sets for even numericData.4) Next we describe our trade-off algorithmthat allows the client to partition his dataIn a controlled manner, thereby achieving thedesired degree of balance between privacy andprecision.CB1CB2CB3
18Controlled diffusion Algorithm Compute optimal buckets on data set D B1 … BMFix max degradation factor = KInitialize M empty composite buckets CB1 … CBMSet target size of each CB tofCB = |D|/M (equidepth)∀ Biselect di CB’s at random, wheredi = K*|Bi|/fCBDiffuse elements of Bi into these uniformly at random
19Controlled Diffusion (Example) Query optimal bucketsDegradation factor k = 2Metadata size increases from O(M) to O(KM)Freq101010B1B2B3B46Final set of buckets on server4444422ValuesCB1CB1CB2CB2CB3CB3CB4CB4Composite Buckets
20Some features of the diffusion algorithm Many consecutive optimal buckets might get diffused into common set of CB’s Observed precision degradation < KElements with same values can go to multiple buckets Giving it an extra degree of freedom compared to hashingNot best for point queriesRandom choice in the algorithm Each bucket distribution approaches data distribution as K increases reducing information gained by adversary by learning buckets
21Optimal data partitioning for range queries OutlineOptimal data partitioning for range queriesAdversarial goals & privacy measuresBalancing privacy and precisionExperiments & conclusion1) So now we have given an easy algorithmTo partition data, which allowsOne to trade-off bounded amount of queryPrecision to enhance data privacy.2) Let us now see some experimental resultsTo see how effective this method is.
22Query workloads (2 of size 104 each) ExperimentsData setsSynthetic Data: 105 Integers in [0,999] uniformly at randomReal Data: 104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive)Query workloads (2 of size 104 each)End points chosen uniformly at random from the respective ranges
23Relative decrease in precision of composite buckets Relative increase in standard deviation in composite bucketsRelative increase in entropy in composite buckets1) The top figure shows the ratio ofAverage query precision using optimalBuckets to that using composite bucketsUsing the same # of buckets. Each set ofPoints is for a fixed value of the “maximumAllowed Degradation factor” K.2) The bottom left figure shows the averageIncrease in standard deviation of theDistribution within CBs to that within the optimalBuckets, again for various values of K, againstIncreasing # of buckets.3) The bottom right picture shows it for theEntropy ratio
24Composite buckets (sample) K = 6, M = 350K = 10, M = 250
25Visualizing trade-offs for various bucketization parameters Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2The same point in the precision vs standard deviation trade-off space Provides an easy way to visualize the design space and choose parameters of interest1) These plots show the privacy andPerformance in a trade-off space.2) Each class of points refers to aFixed value of K and each point in aClass denotes the characteristics ofFor the total number of buckets used.3) For example, the marked points showThe average entropy against average precisionFor when 100 composite buckets are used,With a degradation factor of 2.Similarly the lower figure shows the correspondingAverage standard deviation for these 100Buckets.
26An optimal algorithm for partitioning data for range queries SummaryAn optimal algorithm for partitioning data for range queriesStatistical measures of data privacyVarianceEntropyFast & simple algorithm for re-bucketizing dataBounded amount of precision degradationSubstantial increase in privacy level
27Related workHacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”.Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”.Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.
29Privacy-preserving DM & Statistical DB Privacy in DASHere goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general !Privacy-preserving DM & Statistical DBDASPrivacy criteria: Hide as much information as possible (even at the aggregate level)Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy)Privacy criteria: Protect against disclosure of identityUtility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible
30Individual Privacy Measure Average Squared Error of Estimation (ASEE)Error in approximating true value of a r.v XB byanother r.v XB’ (learned by A)ASEE(XB,XB’) =Var(XB) + Var(XB’) + (E(XB) – E(XB’))2Variance of bucket distribution, Var(XB) is ourmeasure of individual privacy (lower bound)We first define the term: “Average Squared Error of Estimation” for A:ASEE(X, X’) is simply a measure of the error that A is expected to make whenapproximating the value of a random variable X by X’It is shown in the paper that ASEE is given by the following expression when X and X’ areIndependent.Var(X) denotes variance of X and E(X) denotes the expected value of X.Therefore we see that ASEE is lower bounded by the variance of the true distribution of X, whichWe can always compute irrespective of the variance of the estimator random variable !Intuitively, we are saying that if an element belongs to a diverse, heterogeneous set of elements, thenLess information is disclosed regarding the individual.(NOTE: In the single attribute case, X and X’ are independent, but it might not be the case in bucketization over multiple related attributes)
31Set oriented Privacy Measure Entropy of bucket distribution is our measure for query-centric privacyMeasures uncertainty associated with a r.v (Eg. True class of an element for categorical data)An inverse measure of the quality of partial solution sets* that A can derive for a queryH(X) = - ∑ pilogpiPi denotes the proportion of elements from class IShannon entropy is easy to visualize in terms of discrete variables, say like categorical dataShannon entropy is a well accepted measure of “uncertainty” associated with a random variable.Entropy increases as a distribution approaches uniformity and the increase in domain size.Entropy is also shown to be an inverse measure of A’s average ability to determine “partially correct” query-sets.i.e. on an average how accurately can A select a set of elements which have values from a specified range.* Refer to extended paper for more rigorous definition of entropy and its connection to query centric privacy
32Meta data size increase in diffusion The meta data increases from O(M) toK*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb= (K/fcb) * (|B1| + |B2| + … + |BM|)= (KM/|D|)*|D| = O(KM)