Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.

Similar presentations


Presentation on theme: "The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007."— Presentation transcript:

1 The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

2 Introduction Many data representation problems require the optimization of one parameter under a bound on one or more others. Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics. Parameters involved have a monotonic relationship. Hence, an alternative approach is possible, based on dual problems.

3 Outline Histograms. Restricted Haar Wavelet Synopses. Unrestricted Haar and Haar+ Synopses. l-Diversification in 1D. Compact Hierarchical Histograms.

4 Histograms Approximate a data set [d 1, d 2, …, d n ] with B buckets, s i = [b i, e i, v i ] so that a maximum-error metric is minimized. Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 Recent solutions: Buragohain et al. ICDE 2007 Guha and Shim TKDE 19(7) 2007 (linear for )

5 Histograms Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ][ 16 ][ 4.5 ][… Generalized to any weighted maximum-error metric. Each value d i defines a tolerance interval Bucket closed when running union of interval becomes null Complexity:

6 Histograms Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε Complexity: For error values requiring space, with actual error, run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Independent of buckets B

7 34 16 2 20 20 0 36 16 0 18 7 -8 9 -9 10 25 11 10 26 Restricted Haar Wavelet Synopses Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized. Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005 18

8 Restricted Haar Wavelet Synopses Solve the error-bounded problem. Muthukrishnan FSTTCS 2005 Local search within each of subtrees in bottom Haar tree levels Complexity: Apply to the space-bounded problem. Complexity:no significant advantage

9 Unrestricted Haar and Haar + Synopses Assign arbitrary values to Haar/Haar + coefficients, so that a maximum-error metric is minimized. Classical solutions: Guha and Harb KDD 2005, SODA 2006 c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco d3d3 d2d2 d1d1 d0d0 - + + + - + c4c4 + - + + + C3C3 time space Karras and Mamoulis ICDE 2007

10 Unrestricted Haar and Haar + Synopses Solve the error-bounded problem. Complexity: Apply to the space-bounded problem. Complexity: unrestricted Haar Haar + time space significant time & space advantage

11 l-Diversification in 1D Given database table T(A 1, A 2,…, A n ), a quasi-identifier attribute set Q T is a subset of attributes which can reveal the personal identity of records. Equivalence class with respect to quasi-identifier attribute set Q T is a set of records indistinguishable in the projection of T on Q T. A database table T with quasi-identifier set Q T and sensitive attribute S conforms to the l-diversity property iff each equivalence class in T with respect to Q T has at least l well- represented values of S [Machanavajjhala et al. ICDE 2006] Utility metric: Extent of equivalence class (group). Other parameter: Outliers, records whose quasi-identifier values are suppressed.

12 1030507090 76543217654321 Lead Poisoning Parkinson’s Flu Hyperthyroidism Age Postcode Age Postcode 1030507090 76543217654321 l-Diversification in 1D A two-dimensional example.

13 quasi-identifier Sensitive value l-Diversification in 1D Study the problem in one dimension (a single quasi- identifier). Total order exists. Similar to histogram construction. Polynomially tractable.

14 quasi-identifier Sensitive value D1D1 D3D3 D2D2 D4D4 r1r1 r6r6 r4r4 r2r2 r3r3 r5r5 Groups consecutive in each sensitive value domain. Groups order the same in each domain. Example for l=3. l-Diversification in 1D

15 quasi-identifier Sensitive value D1D1 D3D3 D2D2 D4D4 r1r1 r6r6 r4r4 r2r2 r3r3 r5r5 Groups consecutive in each sensitive value domain. Groups order the same in each domain. Example for l=3 l-Diversification in 1D

16 quasi-identifier Sensitive value e E l-Diversification in 1D Given interval I of extent E, which includes c items with m different sensitive values, number of possible boundaries/groups in I is:

17 l-Diversification in 1D Solve the outlier minimization problem. Complexity: timespace Apply to the accuracy maximization problem. Complexity: Apply to the privacy maximization problem. Complexity: time

18 Compact Hierarchical Histograms Assign arbitrary values to CHH coefficients, so that a maximum- error metric is minimized. Heuristic solutions: Reiss et al. VLDB 2006 c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 d3d3 d2d2 d1d1 d0d0 time space The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]

19 Compact Hierarchical Histograms Solve the error-bounded problem. Next-to-bottom level case cici c 2i c 2i+1 cici c 2i

20 Compact Hierarchical Histograms Solve the error-bounded problem. General, recursive case Complexity: time space Apply to the space-bounded problem. Complexity: Polynomially Tractable

21 Conclusions Offline data representation problems under constrains are more easily solvable through their counterparts optimizing another parameter. Dual-problem-based algorithms are simpler, more scalable, more elegant, and more memory- parsimonious than the direct ones. In the CHH case, the dual-problem-based algorithm achieves an optimal solution to the maximum-error longest-prefix-match CHH partitioning problem, which was considered intractable. Future: assessment of privacy and CHH solutions.

22 Related Work H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004 M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non- Euclidean Error. KDD 2005 S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005 S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006 we devised a specialized, highly efficient method for the case that a F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for hierarchical identifiers. VLDB 2006 A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l- diversity: Privacy beyond k-anonymity. ICDE 2006 P. Karras and N. Mamoulis. The Haar + tree: a refined synopsis data structure. ICDE 2007

23 Thank you! Questions?


Download ppt "The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007."

Similar presentations


Ads by Google