Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Similar presentations


Presentation on theme: "Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University."— Presentation transcript:

1 Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University

2 Horizontally Partitioned Personal Information p 1 p 2 p n Table T for analysis at server Client C 1 Original Row r 1 Perturbed p 1 Client C 2 Original Row r 2 Perturbed p 2 Client C n Original Row r n Perturbed p n EXAMPLE: What number of children in this county go to college?

3 Vertically Partitioned Enterprise Information IDC1C1 John 1 Alice 5 Bob 18 IDC1C1 John 1 Alice 7 Bob 18 IDC2C2 C3C3 John 279 Alice 536 IDC2C2 C3C3 John 359 Alice 537 IDC1C1 C2C2 C3C3 John 1359 Alice 7537 Original Relation D 1 Perturbed Relation D’ 1 Original Relation D 2 Perturbed Relation D’ 2 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

4 Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

5 Privacy Preserving OLAP Compute select count(*) from T where P 1 and P 2 and P 3 and …. P k where P 1 and P 2 and P 3 and …. P k i.e. COUNT T ( P 1 and P 2 and P 3 and …. P k ) We need to provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes.

6 Uniform Retention Replacement Perturbation 1 1 3 4 2 5 4 3 1 3 HEADS: RETAIN TAILS: REPLACE U.A.R FROM [1-5] BIAS=0.2

7 Retention Replacement Perturbation Done for each column The replacing pdf need not be uniform Different columns can have different biases for retention

8 Talk Outline Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments

9 Single Attribute Example What is the fraction of people in this building with age 30-50? Assume age between 0-100 Whenever a person enters the building flips a coin of bias p=0.2 for heads say. –Heads -- report true age –Tails -- random number uniform in 0-100 reported Totally 100 randomized numbers collected. Of these 22 are 30-50. How many among the original are 30-50?

10 Analysis 80 Perturbed 20 Retained Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

11 Analysis Contd. 64 Perturbed, NOT Age[30-50] 16 Perturbed, Age[30-50] 20 Retained 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t.

12 Analysis Contd. Since there were 22 randomized rows in [30-50]. 22-16=6 of them come from the 20 retained rows. 16 Perturbed, Age[30-50] 64 Perturbed, NOT Age[30-50] 6 Retained, Age[30-50] 14 Retained, NOT Age[30-50]

13 Scaling up Total Rows Age[30-50] 206 100 ? 30 Thus 30 people had age 30-50 in expectation.

14 Multiple Attributes (k=2) QueryEstimated on T Evaluated on T` count(¬P 1 ٨¬P 2 ) x0x0 y0y0 count(¬P 1 ٨P 2 ) x1x1 y1y1 count(P 1 ٨¬P 2 ) x2x2 y2y2 count(P 1 ٨P 2 ) x3x3 y3y3

15 Architecture

16 Formally : Select count(*) from R where P p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate P ( in example) a = 1-b

17 Transition matrix (1-p)a + p(1-p)b (1-p)a(1-p)b+p Count T (: P)Count T ( P)Count T’ (: P)Count T’ (P) = i.e. Solve xA=y A 00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A 00 = (1-p)a+p

18 Multiple Attributes For k attributes, x, y are vectors of size 2 k x=y A -1 Where A=A 1 ­ A 2 ­.. ­ A k [Tensor Product]

19 Error Bounds In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9 Given T !  T’, with n rows f(T) is (n, ,  ) reconstructible by g(T’) if |f(T) – g(T’)| < max ( ,  f(T)) with probability greater than (1-  ).  f(T) =2,  =0.1 in above example

20 Results Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, ,  ) estimator for f if n > 4 log(2/  )(p  ) -2, by Chernoff bounds Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

21 Talk Outline Motivation Problem Definition Query Reconstruction Inversion method Iterative method Privacy Guarantees Experiments

22 Iterative Algorithm [AS00] Iterate: x p T+1 =  q=0 t y q (a pq x p T / (  r=0 t a rq x r T )) [ By Application of Bayes Rule] Initialize: x 0 =y Stop Condition: Two consecutive x iterates do not differ much

23 Iterative Algorithm RESULT [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < x i, 8 0 < i < 2 k -1

24 Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

25 Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach

26 Let X, Y be random variables where X = original value, Y= perturbed value. Let Q, S be subsets of their domains Apriori Probability P[ X 2 Q] = P q ·  1 Posteriori Probability P[X 2 Q | Y 2 ] ¸  2 where 0 0 S Privacy Guarantees Q Where p q /m q < s, i.e. Q is a rare set (m q = probability of Q under replacing pdf) (  1,  2 ) Privacy breach (s,  1,  2 ) Privacy breach S Q

27 (s,  1,  2 ) vs (  1,  2 ) metric –Provides more privacy to rare sets e.g. : in market basket data, medicines are rarer than bread, so we provide more privacy for medicines than for bread –For multiple columns, s expresses correlations –Works for retention replacement perturbation on numeric attributes

28 (s,  1,  2 ) Guarantees The median value of s is 1 There is no (s,  1,  2) privacy breach for s < f(  1,  2,p) for retention replacement perturbation on single as well as multiple columns

29 Application to Classification[AS00] For the first split to compute split criterion/gini index Count(age[0-30] and class-var=‘-’) Count(age[0-30] and class-var=‘+’) Count(: age[0-30] and class-var=‘-’) Count(: age[0-30] and class-var=‘+’)

30 Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

31 Real data: Census data from the UCI Machine Learning Repository having 32000 rows Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000 Error metric: l 1 norm of difference between x and y. Eg for 1-dim queries |x 1 – y 1 | + | x 0 – y 0 |

32 Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

33 Privacy Obtained Privacy as a function of retention probability on 3 attributes of census data

34 Error vs Number of Columns: Census Data Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns

35 Error as a function of number of Rows Error has square root n dependence on number of rows

36 Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. PRIVACY PRESERVING OLAP is PRACTICAL

37 References [AS00] Agrawal, Srikant: Privacy Preserving Data Mining [AA01] Agarwal, Aggarwal: On the Quantification of… [W65] Randomized Response.. [EGS] Evfimievski, Gehrke, Srikant: Limiting Privacy Breaches.. Others in the paper..

38 The error in the iterative algorithm flattens out as its maximum value is bounded by 2 Error vs Number of Columns: Iterative Algorithm: Zipf Data

39 Supported by Privacy Group at Stanford: Rajeev and Hector


Download ppt "Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University."

Similar presentations


Ads by Google