Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.

Similar presentations


Presentation on theme: "On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint."— Presentation transcript:

1 On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint with Wenliang Du, David Eppstein, and George Lueker

2 Motivation Privacy is a concern with respect to information in relational data bases –rows are associated with people –columns are attributes K-anonymity –No query should reveal less than K individuals image source: http://neodv8.blogspot.com/2007/09/neutral-mask-masterclass.html

3 Generalization Replace specific attributes with more general ones, so no category has fewer than K members. source: ℓ-Diversity: Privacy Beyond k-Anonymity Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer Muthuramakrishnan Venkitasubramaniam Department of Computer Science, Cornell University

4 Data Types Linear: Easy greedy algorithm is optimal Unordered: arbitrary groupings possible GPS coordinates: group using rectangles Zip codes: should use proximity, not text image source: http://eagereyes.org/Applications/ZIPScribbleMap.html

5 Previous Work [Samarati, Sweeney, 98] introduce concept of k- anonymization and generalization to achieve it. [Meyerson, Williams, 04] show optimal generalization or unordered data is NP-hard, but their proof requires as many attributes as people. And similar proofs are due to [Aggarwal et al., 05] and [Byun et al., 07]. [Khanna, Muthukrishnan, Paterson, 98] study a rectangle tiling problem similar to GPS coordinate generation, showing 5/4-approximations are not possible unless P=NP. Lots more work on k-anonymization and its variants…

6 Our Results Zip codes: has a 4-approximation, but no 4/3-approximation unless P=NP GPS coordinates: has a 5-approximation, but no 4/3-approximation unless P=NP Unordered: is NP-hard but has a PTAS. Also, this version of the problem gives rise to a new type of bin-packing problem.

7 Min-Max Bin Covering max min (k) image source: http://www.developerfusion.com/article/5540/bin-packing/4/

8 Min-Max Bin Cover is NP-hard Reduction from: Reduction method:

9 A Next-Fit Method: “Fold” Theorem: There is a linear-time algorithm, A, guaranteeing Proof idea: Put items of size at least k into their own bins, and use Next Fit for remaining items. –all but the last bin have level at most 2k − 2, as they each have at most k − 1 before the last item. –There may be one leftover bin with level less than k, which must be merged with some other bin.

10 Our PTAS: “Spread” Theorem: For each fixed > 0, there is a polynomial time algorithm A that, given some instance X of Min-Max Bin Covering, finds a solution satisfying A(X) ≤ (1 + )(Opt(X) + 1). Note: Normalize so k=1 and note that if there is an item of size > 3, then Next-Fit Theorem gives an optimal solution. We can assume, wlog, that the optimal solution has cost at most 3

11 The Spread Algorithm Warm-up Call items < “small” and others “large” –Note that any solution will have at most 3n bins. For any packing P, let the type of P be a packing where we throw out all small items and round all large items down to largest smaller value that is a product of and a power of (1+). (1+ ) 5

12 More Warm-up There are a constant number of rounded values, for fixed ; hence, a constant number of configurations – ways of filling a bin to at most 3 with rounded values. Represent a type by counts of each configuration, so that there are a polynomial number of types (with at most 3n bins). configurations:14325 bin counts:40681 (constant number)

13 The Spread Algorithm For each type T: 1.Let T’ be packing with rounded values replaced with corresponding original (large) values. 2.Pack small values into T’ using greedy method of choosing bin with lowest level. 3.Merge pairs of smallest bins until every bin has a level of at least 1. Pick the one that minimizes the size of the largest bin.

14 Why it Works The type for the optimal solution is considered by the Spread algorithm. The T’ in this instance has cost at most (1+) times the optimal cost. During the greedy completion, the maximum bin must be at most (1+)Opt +, for otherwise we would have used more than the original set of items When we merge bins, we may merge one with level less than 1 with one of level (1+)Opt + ; hence max of (1+)Opt + 1 +

15 Experimental Results Apply to names in the U.S. Census data: –FEMALE-1990: Female first names and their frequencies, for names with frequency at least 0.001%. –MALE-1990: Male first names and their frequencies, for names with frequency at least 0.001%. –LAST-1990: Surnames and their frequencies, for surnames with frequency at least 0.001%.

16 Fold versus Spread Apply to random and sorted orders, since both algorithms consider items according to their input order. Test each algorithm for increasing k. At certain threshold levels of k, the number of bins is reduced, which causes some “jaggedness” in the results.

17 Female-1990

18 Male-1990

19 Last-1990

20 Zip Code Generalization is NP-Hard Formally, 3-Regular Planar Partition into Paths of Length 2 (3PPPL2): Given a 3-regular planar graph G, can G be partitioned into paths of length 2?

21 Proof Sketch Reduction from 3-Dimensional Matching: –Given triples (x,y,z) from sets X,Y,Z, find a set of triples such that each member of X, Y, and Z belong to exactly one triple.

22 Proof Sketch Crossover gadget:

23 Proof Sketch Crossover gadget:

24 Additional Results An 4/3-approximation algorithm for planar graphs NP-hardness and 4/3-approximation algorithm for two-dimensional points.

25 Conclusion We have shown that generalization is NP-hard and in some cases cannot be arbitrarily approximated unless P=NP. We have given approximation algorithms for the versions we study: –unordered data –planar graphs (generalized into connected components) –two-dimensional points (generalized with rectangles)


Download ppt "On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint."

Similar presentations


Ads by Google