Presentation is loading. Please wait.

Presentation is loading. Please wait.

Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center.

Similar presentations


Presentation on theme: "Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center."— Presentation transcript:

1 Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center

2 Outline Motivation and Introduction Motivation and Introduction OPES encryption OPES encryption Modeling the distribution Modeling the distribution Experimental evaluation Experimental evaluation

3 Motivation Encryption is rapidly becoming a requirement in a myriad of business settings (e.g., health care, financial, retail, government), driven by legislations (e.g. SB1386, HIPAA) Encryption is rapidly becoming a requirement in a myriad of business settings (e.g., health care, financial, retail, government), driven by legislations (e.g. SB1386, HIPAA) Encrypting databases unleashes a host of problems: Encrypting databases unleashes a host of problems: –Performance slowdown –Incompatibility with standard database features E.g. comparison predicates and the use of indexes E.g. comparison predicates and the use of indexes –Changes to applications for encryption Encryption functions now appear in queries Encryption functions now appear in queries

4 if (p1 < p2) then (c1 < c2) Order Preserving Encryption Function E is an order preserving encryption function, and p 1 and p 2 are two plaintext values, and c 1 = E(p 1 ) c 2 = E(p 2 )

5 Threat Model The storage system used by the DBMS is untrusted, i.e. vulnerable to compromise The storage system used by the DBMS is untrusted, i.e. vulnerable to compromise The DBMS software is trusted The DBMS software is trusted Ciphertext only attack Ciphertext only attack –The adversary has access to all (but only) encrypted values Guard against percentile exposure Guard against percentile exposure –An adversary should not be able to get even an estimate of true values

6 Design Goals Query results from OPES will be sound and complete Query results from OPES will be sound and complete Comparison operations will be performed without decrypting the operands Comparison operations will be performed without decrypting the operands Standard database indexes can be used over encrypted data Standard database indexes can be used over encrypted data Tolerate updates Tolerate updates

7 Integration of Encryption and Query Processing Select name from Emp where sal > 100000 Queries Select decrypt (“xsxx”) from “cwlxss” where “xescs” > OPESencrypt(100000) DBMS Translation layer Encrypted data And metadata Users have a plaintext view of an encrypted database Plaintext queries are translated into equivalent queries over encrypted data Tables are encrypted using standard as well as order preserving encryption Comparison operators are directly applied over encrypted columns We hereafter strictly focus on the OPES algorithms

8 Outline Motivation and Introduction Motivation and Introduction OPES encryption OPES encryption Modeling the distribution Modeling the distribution Experimental evaluation Experimental evaluation

9 Approach Plaintext data has unknown distribution Plaintext data has unknown distribution User selects the target (ciphertext) distribution User selects the target (ciphertext) distribution Ciphertext values exhibit the target distribution Ciphertext values exhibit the target distribution

10 Effect of OPES Encryption on Plaintext Distributions Input: Gaussian, Target: Zipf Original Encrypted Target Input: Uniform, Target: Zipf

11 OPES Key Generation Sample of source values from the plaintext distribution Sample of target values from the ciphertext distribution OPES Key Generation OPES Key

12 OPES Keys Source Uniform Target Source to uniform Target to uniform

13 Two Step Encryption Source (plaintext) to uniform Source (plaintext) to uniform Uniform to target (ciphertext) Uniform to target (ciphertext)

14 OPES Encryption Source Uniform Target Encrypt Decrypt Step I Step II Step I

15 Outline Motivation and Introduction Motivation and Introduction OPES encryption OPES encryption Modeling the distribution Modeling the distribution Experimental evaluation Experimental evaluation

16 Modeling the Distribution Histograms Histograms –Equi-depth, equi-width, wavelets Number of buckets required unreasonably large Number of buckets required unreasonably large Over fitting the model Over fitting the model Parametric Parametric –Poor estimation for irregular distributions Hybrid [Konig and Weikum 99] Hybrid [Konig and Weikum 99] –Query result size estimation –Approach Partition the data into buckets Partition the data into buckets Model the distribution within a bucket as a spline Model the distribution within a bucket as a spline Fixed number of buckets Fixed number of buckets

17 Our Approach Hybrid [Konig and Weikum 99] Hybrid [Konig and Weikum 99] –Partition the data into buckets –Model the distribution within each bucket as a linear spline The number of buckets is not fixed The number of buckets is not fixed We use MDL to determine the number of bucket boundaries We use MDL to determine the number of bucket boundaries

18 MDL The best model for encoding data minimizes the sum of the cost of The best model for encoding data minimizes the sum of the cost of –Describing the model –Describing data in terms of the model

19 Model Costs Data Cost Data Cost –Using a mapping M from [p l,p h ) to [f l,f h ), the cost of encoding p i is C(p i )=log(f i -E(i)) C(p i )=log(f i -E(i)) DC(p l,p h ) = C(p l )+C(p l+1 )+…+C(p h-1 ) DC(p l,p h ) = C(p l )+C(p l+1 )+…+C(p h-1 ) Incremental Model Cost Incremental Model Cost –Fixed cost for each additional bucket Boundary value Boundary value Boundary parameters Boundary parameters –Slope –Scale factor

20 Computing Boundaries Growth phase Growth phase –[p l,p h ) with h-l-1 sorted points {p l+1,p l+2,…,p h-1 } Compute spline for [p l,p h ) Compute spline for [p l,p h ) Compute [f l,f h ) using the spline Compute [f l,f h ) using the spline –Find further split point p s with f s having the maximum deviation from the expected value Prune phase Prune phase –LB(p l,p h )=DC(p l,p h )-DC(p l,p s )-DC(p s,p h )-IMC –GB(p l,p h )=LB(p l,p h )+GB(p l,p s )+GB(p s,p h ) –if (GB > 0), the split at p s is retained

21 Scaling Source Uniform xxxxx xxxxx b-1 bb+1 Number of values in a bucket may be disproportional to the size of the bucket

22 Updates The scale factor ensures that each distinct plaintext value maps to distinct ciphertext values The scale factor ensures that each distinct plaintext value maps to distinct ciphertext values Encrypted values need not be recomputed unless the distribution of plaintext values changes Encrypted values need not be recomputed unless the distribution of plaintext values changes

23 Quality of Encryption KS Statistical Test KS Statistical Test –Can we disprove, to a certain required level of significance, the null hypothesis that two data sets are drawn from the same distribution function? –If not, then the ciphertext distribution cannot be distinguished from the specified target distribution

24 Duplicates Assumptions Assumptions –A large number of duplicates may leak information about the distribution of values Alternatively, Alternatively, –Map duplicates to distinct values –if (f = M(p), f’ = M(p+1)) [f,f’) = M(p) [f,f’) = M(p) –Equality expressed as a range –Equi-joins can no longer be expressed However, many numeric attributes (e.g., salary) may rarely be used in joins However, many numeric attributes (e.g., salary) may rarely be used in joins

25 Outline Motivation and Introduction Motivation and Introduction OPES encryption OPES encryption Modeling the distribution Modeling the distribution Experimental evaluation Experimental evaluation

26 Experimental Evaluation Percentile exposure Percentile exposure Updatability Updatability Key size Key size Time overhead Time overhead

27 Datasets Census Census –UCI KDD archive, PUMS census data (30,000) records Gaussian Gaussian Zipf Zipf Uniform Uniform Source:Gaussian Target:Zipf Default

28 Percentile Exposure Source distribution Target distribution Average change in percentile CensusGaussian37 CensusZipf7 CensusUniform38 GaussianZipf45 GaussianUniform17 ZipfUniform44

29 Time to the Build Model

30 Insertion Overhead

31 Cost of Additional Insertion

32 Retrieval Overhead

33 Retrieval Time

34 Related Work Polynomial functions Polynomial functions –Ignores the distribution of plaintext/ciphertext values Database as a service Database as a service –Requires post processing of query results Privacy homomorphisms Privacy homomorphisms –Comparison operations not investigated Keyword searches on encrypted data Keyword searches on encrypted data –Designed for keyword retrieval –Range queries not supported Smartcard-based schemes Smartcard-based schemes –Infeasible for large ranges Order-preserving hashing Order-preserving hashing –Protecting the hash values from cryptanalysis is not a concern, nor is deciphering plaintext values from hash values –Designed for static collections

35 Closing Remarks Ensuring safety without impeding the flow of information is a hard problem Ensuring safety without impeding the flow of information is a hard problem Current choices Current choices –Plaintext database –Encrypted databases with loss of functionality or performance Our approach focused on the trade-off between security and efficiency Our approach focused on the trade-off between security and efficiency We developed an algorithm which could easily be integrated with current systems We developed an algorithm which could easily be integrated with current systems

36 Backup

37 Encode Encode(p) = z(sp 2 +p) p c [0,p h ), s = q/(2r), p c [0,p h ), s = q/(2r), z > 0 distribution has density function qp + r p is the source (target) value s is the quadratic coefficient z is the scale factor

38 Decode z ! z 2 + 4zsf 2zs f c [0, f h ), s = q/(2r), z > 0 f is the flattened value s is the quadratic coefficient z is the scale factor Decode (f) =

39 Order Preserving Encryption CiphertextPlaintext 128000 235000 …… CnCnCnCn PnPnPnPn Compute distinct attribute values in ascending order Ciphertext is the index value Effectively hides the distribution of plaintext values Effectively hides the distribution of plaintext values The key size is proportional to the number of distinct attribute valuesThe key size is proportional to the number of distinct attribute values Any updates require recomputing the key and ciphertext valuesAny updates require recomputing the key and ciphertext valuesNoNamePositionSalaryLocation……………

40 Target Distribution Requirement Why isn’t the source-to-uniform transformation sufficient for order preserving encryption? Why isn’t the source-to-uniform transformation sufficient for order preserving encryption? It is, but It is, but –The target distribution may cause an adversary to make incorrect assumptions about the source distribution –The organization of the source distribution cannot be inferred from the target

41 Quadratic Coefficient b1b1b1b1 b2b2b2b2 xxxxxxxx … v b1 – v b2 v = xx i1i1i1i1 j1j1j1j1 i2i2i2i2 j2j2j2j2 j 2 – i 2 j 1 – i 1 v j2 – v i2 v j1 – v i1 - q = s = 2 j 1 – i 1 v j1 – v i1 q

42 Scale Factor Constraints for all p c [0,w) : M(p+1) – M(p) o 2 w f = Kn Ensures that there is a distinct mapped value for each input value The width of a bucket in the mapped space is a function of the number of elements n in the bucket K is the minimum width needed across buckets

43 Scale Factor Kn sw 2 + w z = K = max [x(sw i 2 +w)], i = 1, …, m, x = 2,s o 0 2/(1 + s(2w – 1)), s < 0 The scale factor will stretch short buckets to the width of the largest bucket, further increasing the dimension of a bucket by a factor of the number of elements in the bucket

44 Slope b-1 b The values within a single bucket are unevenly distributed within the bucket


Download ppt "Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center."

Similar presentations


Ads by Google