Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel.

Similar presentations


Presentation on theme: "Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel."— Presentation transcript:

1 Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel Nunez, Robert Garfinkel, and Ram Gopal JHU October 5, 2006

2 Motivation Two general goals for a statistical database: Protect confidential records Provide useful information The goals are often in conflict -> tradeoff Problems faced by Census Bureaus, etc.

3 An Example RecordNameJobAgeCo.Salary 1RobinsonManager27A55 2ReeseTrainee42B31 3FurilloManager63C107 4CampanellaTrainee28B 5CoxManager55B63 6SniderManager57A82 7KoufaxTrainee21D29 8NewcombeTrainee32C31 9HodgesManager35D60 10BrancaTrainee36D27 11LoesManager47B37 12RoeTrainee28D42 13ReiserManager64A94 14GilliamManager46C51

4 Types of Protection Random data perturbation Add noise to data, answer all queries Query restriction/inference control Provide exact answers to some queries, but refuse to answer others Keep track of answered queries (auditing) Camouflage/interval methods Provide interval answer to queries Answer all queries

5 Exact Disclosure DB is [2, 5, 8], two target SUM queries: q 1 = [1, 1, 1], answer is 15 q 2 = [1, 1, 0], answer is 7 User can solve system And learn that a 3 = 8

6 Another Perspective Notice that a linear combination of q 1 and q 2 yields the canonical vector e3 = [0, 0, 1] Namely, q 1 – q 2 = e 3 In general, a group of linear queries is “exactly safe” if none of the canonical vectors can be expressed as a linear combination of the queries in the group

7 Degrees of Disclosure Exact disclosure User is unable to learn the exact confidential value of any subject Interval disclosure User is unable to learn that confidential value is within a subject pre-specified interval Stochastic disclosure User is unable to randomly estimate a confidential value with high probability

8 Previous Work on Query Restriction: J.O.C. Interval disclosure SUM and MIN (MAX) queries Determine heuristic restriction of the polytope that describes the user’s knowledge. Use it to decide whether to answer the next query. Would it result in a “safe” polytope? Collusion, but not auditing, is a problem “Success” is a function of no. of answered queries

9 QR continued Queries arrive online Decisions are made without knowing what comes next If all queries were known, finding a maximum cardinality set to answer is NP-Hard (Chin & Ozsoyoglu).

10 Previous work on Camouflage (CVC): Operations Research Hide confidential vector in the interior of a “safe” polytope Π. Answer all queries q=f(x) with the interval [min f(x), max f(x), x ε Π] Answers are deterministically correct Depending on the query type, finding polynomial, minimum access algorithms is not trivial!

11 CVC Continued A set of linear queries can be predetermined to safely yield exact answers via a network flow formulation Collusion is not a problem. The same cannot be said of “insider information”

12 Current extensions of CVC Dealing with insider threats “Data” vs. “Process” Finding the best (smallest) camouflaging set based on the threat type and level Is it necessarily a polytope?

13 Hybrid of Query Restriction and Data Perturbation Provide an algorithm to determine which queries (from a given set) can be exactly answered without compromising confidentiality (safe subset) Provide a protection mechanism to answer all other queries Maintain consistency of exact answers and protected answers

14 Our Approach Given a target set of weighted queries, follow a 3-phase process: 1.Find the maximum weight query subset that can be safely answered exactly 2.For other target queries, answer safe approximate queries exactly (optional) 3.Answer all other non-target queries using a consistent perturbed DB

15 Importance of Consistency Suppose q is answered exactly. In the absence of consistency a user who wants to determine a i can ask a series of queries q´= q + e i to get a set of i.i.d. estimates of a i As the number of such queries gets large the error in the resulting estimate of a i goes to zero.

16 What is Given to the User? Guaranteed exact answers to safe target queries Public answers imply no threat from user collusion Approximate answers to unsafe target queries This way, we ensure some degree of information for all target queries Access to a perturbed DB for all other non- target query

17 Model Assumptions DB has n subjects Only one confidential field: a є R n (could be a stacking of any number of such fields) Every subject is identifiable by the record index Set of subject indexes: N, |N| = n Queries have nonnegative weights

18 Phase 1: Query Restriction Set of target queries: T Query weights: w Index set to queries in T: M, |M| = m Sum of weights for K subset of M:

19 Phase 1 Optimization Problem Problem OPT: Where F is a family of “safe” subsets But before defining a safe set, let’s talk about matroids …

20 Matroids Modeling theory founded by H. Whitney, 1935 Many applications in combinatorial optimization: Maximal spanning tree Matroid intersection Maximal partition/matching Etc

21 Quick Definition Matroid is a pair (M, F): M is a finite set, F is a family of subsets of M Elements of F are called “independent” sets Two properties: If K is in F, then all subsets of K are in F If K and L are in F, |K| = |L| + 1, then one element of K can be added to L to create a new independent set Rank of K, r(K), is the cardinality of largest independent set in K

22 Example: MST All sub-trees are independent sets Matroid is the collection of sub-trees The rank of a subgraph is the number of links of the largest tree in the subgraph 100 300 200400 500600

23 Example: Sets of L.I. vectors Find a linear basis from a matrix The matroid consists of subsets of linearly independent columns A basis is an independent set of maximum cardinality Rank of a submatrix is the column-rank of the submatrix

24 Non example Consider an Assignment Problem A set of cells is independent if no row or column appears more than once. Seems to be almost a matroid but it’s not!

25 Main Matroid Result Given a set of non-negative weights assigned to the elements of M If (M, F) is a matroid, then the Greedy algorithm will find an independent set (i.e. a set in F) that maximizes the sum of the weights

26 Matroid Intersection Given k matroids (M, F 1 ), …, (M, F k ) and weights for the elements of M, the goal is to find a common independent set that maximizes the sum of the weights Problem: intersection of matroids is not a matroid For general k, the problem is NP-Hard Yet, a modified greedy algorithm works for intersection of 2 matroids

27 Matroid and Inference Given target query set T, let M be the indexes to the queries A subset K of M is safe w.r.t. subject i if the user cannot learn subject i’ s confidential record using linear combinations of queries with index set K Let F i be the safe subsets of M w.r.t. subject i Then, (M F i ) is a matroid! A safe set is safe w.r.t. all subjects, that is, is in the matroid intersection

28 Examples of Safe Sets Four target queries:

29 Independent (Safe) Sets

30 Rank Evaluation

31 Approximate Solutions to OPT Matroid intersection greedy (MIG) algorithm: Start with full index set M 1 = M At iteration t+1, remove one index from M t to create set M t+1 Remove index that minimizes the ratios: Stop when M t becomes a safe set

32 More About MIG Denominator  f j roughly counts in how many additional matroids the set M t+1 will become safe In other words, the best index to remove is chosen so that its weight is low and it will make safe the set M t+1 for many matroids MIG will finish in no more than m iterations, and each iteration can be done in O(m 3 n 2 ) operations

33 Approximation Error Set obtained from MIG: K, M \ K is safe Z is the optimal value of OPT Nemhauser + Wolsey bounds: H(d) is the harmonic number:

34 Example K = {2, 3}, M\K = {1, 4} K* = {2, 3, 4}, W(K*) = 40 Bounds: 20 < Z < 40.4

35 Phase 2: Additional Safe Answers Set S is the chosen set of exact answer queries What to do about a query q in T\S? Answer a query “close to” q Order queries in T\S according to weight For instance, if q is a sum query, answer a safe query with smaller query size Or, answer the closest query to q that is a linear combination of the queries in S

36 Phase 3: Constrained Perturbation Goal: Answer all queries with perturbed data a +  a making sure that answers are consistent with target queries Two almost equivalent methods: Perturb and project onto query hyperplane Perturb on the hyperplane direction

37 Perturb & Project

38 Directional Perturbation

39 Extending Protection What to do to provide interval protection? What to do to provide stochastic protection from exact answers and from the perturbation?

40 Program G3LP Let Q be a matrix whose columns are the exact answer queries Consider linear program G3LP, i є N:

41 Interval Disclosure If z i * = u i * - l i * is optimal to G3LP, then the user will know Interval disclosure occurs when Where  is chosen by subject i

42 Stochastic Disclosure Let X i be a random estimation of a i Let l and u be known bounds on a i For  and  > 0, a i is protected if That is, a i cannot be randomly estimated in any interval of range  or smaller with probability  or higher

43 Protection against stochastic threat from deterministic answers Before perturbation phase, systematically remove queries from exact answer set until the following condition holds for all subjects

44 continued The problem of which queries to be removed is also hard. A greedy heuristic gives similar bounds to those of Phase 1.

45 Stochastic threat from Perturbation Based on the perturbation, confidence intervals on a i can be obtained from Chebyshev’s inequality. Solution is to generate a sequence of i.i.d. perturbations until a safe one is found.

46 Numerical results Results are very encouraging. Large numbers of queries answered exactly Development of a test bank was difficult because of the problem of finding optimal solutions. A class of interesting problems was found for which those solutions were easily determined.


Download ppt "Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel."

Similar presentations


Ads by Google