Download presentation
Published byCindy Wynter Modified over 10 years ago
1
Order Preserving Encryption for Numeric Data Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu IBM Almaden Research Center
2
Outline Motivation and Introduction OPES encryption
Modeling the distribution Experimental evaluation
3
Motivation Encryption is rapidly becoming a requirement in a myriad of business settings (e.g., health care, financial, retail, government), driven by legislations (e.g. SB1386, HIPAA) Encrypting databases unleashes a host of problems: Performance slowdown Incompatibility with standard database features E.g. comparison predicates and the use of indexes Changes to applications for encryption Encryption functions now appear in queries
4
Order Preserving Encryption Function
E is an order preserving encryption function, and p1 and p2 are two plaintext values, and c1 = E(p1) c2 = E(p2) if (p1 < p2) then (c1 < c2)
5
Threat Model The storage system used by the DBMS is untrusted, i.e. vulnerable to compromise The DBMS software is trusted Ciphertext only attack The adversary has access to all (but only) encrypted values Guard against percentile exposure An adversary should not be able to get even an estimate of true values
6
Design Goals Query results from OPES will be sound and complete
Comparison operations will be performed without decrypting the operands Standard database indexes can be used over encrypted data Tolerate updates
7
Integration of Encryption and Query Processing
Users have a plaintext view of an encrypted database We hereafter strictly focus on the OPES algorithms Comparison operators are directly applied over encrypted columns Queries Plaintext queries are translated into equivalent queries over encrypted data Select name from Emp where sal > Translation layer Select decrypt (“xsxx”) from “cwlxss” where “xescs” > OPESencrypt(100000) DBMS Tables are encrypted using standard as well as order preserving encryption Encrypted data And metadata
8
Outline Motivation and Introduction OPES encryption
Modeling the distribution Experimental evaluation
9
Approach Plaintext data has unknown distribution
User selects the target (ciphertext) distribution Ciphertext values exhibit the target distribution
10
Effect of OPES Encryption on Plaintext Distributions
Original Encrypted Target Input: Gaussian, Target: Zipf Input: Uniform, Target: Zipf
11
OPES Key Generation Sample of source values from the plaintext distribution Sample of target values from the ciphertext distribution OPES Key Generation OPES Key
12
OPES Keys Target to uniform Target Source to uniform Uniform Uniform
13
Two Step Encryption Source (plaintext) to uniform
Uniform to target (ciphertext)
14
OPES Encryption Step II Step I Target Uniform Uniform Source Step II
Decrypt
15
Outline Motivation and Introduction OPES encryption
Modeling the distribution Experimental evaluation
16
Modeling the Distribution
Histograms Equi-depth, equi-width, wavelets Number of buckets required unreasonably large Over fitting the model Parametric Poor estimation for irregular distributions Hybrid [Konig and Weikum 99] Query result size estimation Approach Partition the data into buckets Model the distribution within a bucket as a spline Fixed number of buckets
17
Our Approach Hybrid [Konig and Weikum 99]
Partition the data into buckets Model the distribution within each bucket as a linear spline The number of buckets is not fixed We use MDL to determine the number of bucket boundaries
18
MDL The best model for encoding data minimizes the sum of the cost of
Describing the model Describing data in terms of the model
19
Model Costs Data Cost Incremental Model Cost
Using a mapping M from [pl,ph) to [fl,fh), the cost of encoding pi is C(pi)=log(fi-E(i)) DC(pl,ph) = C(pl)+C(pl+1)+…+C(ph-1) Incremental Model Cost Fixed cost for each additional bucket Boundary value Boundary parameters Slope Scale factor
20
Computing Boundaries Growth phase Prune phase
[pl,ph) with h-l-1 sorted points {pl+1,pl+2,…,ph-1} Compute spline for [pl,ph) Compute [fl,fh) using the spline Find further split point ps with fs having the maximum deviation from the expected value Prune phase LB(pl,ph)=DC(pl,ph)-DC(pl,ps)-DC(ps,ph)-IMC GB(pl,ph)=LB(pl,ph)+GB(pl,ps)+GB(ps,ph) if (GB > 0), the split at ps is retained
21
Scaling Number of values in a bucket may be disproportional to the size of the bucket Uniform x x x x x Source x x x x x b b+1 b-1
22
Updates The scale factor ensures that each distinct plaintext value maps to distinct ciphertext values Encrypted values need not be recomputed unless the distribution of plaintext values changes
23
Quality of Encryption KS Statistical Test
Can we disprove, to a certain required level of significance, the null hypothesis that two data sets are drawn from the same distribution function? If not, then the ciphertext distribution cannot be distinguished from the specified target distribution
24
Duplicates Assumptions Alternatively,
A large number of duplicates may leak information about the distribution of values Alternatively, Map duplicates to distinct values if (f = M(p), f’ = M(p+1)) [f,f’) = M(p) Equality expressed as a range Equi-joins can no longer be expressed However, many numeric attributes (e.g., salary) may rarely be used in joins
25
Outline Motivation and Introduction OPES encryption
Modeling the distribution Experimental evaluation
26
Experimental Evaluation
Percentile exposure Updatability Key size Time overhead
27
Datasets Census Gaussian Zipf Uniform
UCI KDD archive, PUMS census data (30,000) records Gaussian Zipf Uniform Default Source: Gaussian Target: Zipf
28
Percentile Exposure Source distribution Target distribution
Average change in percentile Census Gaussian 37 Zipf 7 Uniform 38 45 17 44
29
Time to the Build Model
30
Insertion Overhead
31
Cost of Additional Insertion
32
Retrieval Overhead
33
Retrieval Time
34
Related Work Polynomial functions Database as a service
Ignores the distribution of plaintext/ciphertext values Database as a service Requires post processing of query results Privacy homomorphisms Comparison operations not investigated Keyword searches on encrypted data Designed for keyword retrieval Range queries not supported Smartcard-based schemes Infeasible for large ranges Order-preserving hashing Protecting the hash values from cryptanalysis is not a concern, nor is deciphering plaintext values from hash values Designed for static collections
35
Closing Remarks Ensuring safety without impeding the flow of information is a hard problem Current choices Plaintext database Encrypted databases with loss of functionality or performance Our approach focused on the trade-off between security and efficiency We developed an algorithm which could easily be integrated with current systems Protecting data without impeding the flow of information is an extremely hard problem. Today: no encryption, or if you encrypt, you performance goes to hell First stab at the problem focusing on opes to balance trade-off increasing security without affecting efficiency. In the first stab, we wanted something that is easily integrated with systems. Challenge is to have a complete set and techniques for a system for encrypting a database while still preserving the efficiency of operations.
36
Backup
37
Encode Encode(p) = z(sp2+p) p c [0,ph), s = q/(2r), z > 0
distribution has density function qp + r p is the source (target) value s is the quadratic coefficient z is the scale factor
38
Decode z ! z2 + 4zsf Decode (f) = 2zs
f c [0, fh), s = q/(2r), z > 0 f is the flattened value s is the quadratic coefficient z is the scale factor
39
Order Preserving Encryption
No Name Position Salary Location … Ciphertext is the index value Effectively hides the distribution of plaintext values The key size is proportional to the number of distinct attribute values Any updates require recomputing the key and ciphertext values Ciphertext Plaintext 1 28000 2 35000 … Cn Pn Compute distinct attribute values in ascending order
40
Target Distribution Requirement
Why isn’t the source-to-uniform transformation sufficient for order preserving encryption? It is, but The target distribution may cause an adversary to make incorrect assumptions about the source distribution The organization of the source distribution cannot be inferred from the target
41
Quadratic Coefficient
x x x x x x x x x x … v = b1 b2 i1 j1 i2 j2 j2 – i2 j1 – i1 - vj2 – vi2 vj1 – vi1 q q = s = vb1 – vb2 j1 – i1 2 vj1 – vi1
42
Scale Factor Constraints
for all p c [0,w) : M(p+1) – M(p) o 2 Ensures that there is a distinct mapped value for each input value wf = Kn The width of a bucket in the mapped space is a function of the number of elements n in the bucket K is the minimum width needed across buckets
43
Scale Factor Kn z = sw2 + w K = max [x(swi2+w)], i = 1, …, m, 2, s o 0
The scale factor will stretch short buckets to the width of the largest bucket, further increasing the dimension of a bucket by a factor of the number of elements in the bucket Kn z = sw2 + w K = max [x(swi2+w)], i = 1, …, m, 2, s o 0 2/(1 + s(2w – 1)), s < 0 x =
44
Slope The values within a single bucket are unevenly distributed within the bucket b-1 b
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.