Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.

Slides:



Advertisements
Similar presentations
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.
1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.
Introduction to Algorithms
Copyright (c) 2003 Brooks/Cole, a division of Thomson Learning, Inc
Robust hierarchical k- center clustering Ilya Razenshteyn (MIT) Silvio Lattanzi (Google), Stefano Leonardi (Sapienza University of Rome) and Vahab Mirrokni.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Paths, Trees and Minimum Latency Tours Kamalika Chaudhuri, Brighten Godfrey, Satish Rao, Satish Rao, Kunal Talwar UC Berkeley.
Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity.
The Simplex Method: Standard Maximization Problems
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Instructor Neelima Gupta Table of Contents Lp –rounding Dual Fitting LP-Duality.
An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Approximation Algorithms
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Distributed Combinatorial Optimization
An introduction to Approximation Algorithms Presented By Iman Sadeghi.
Objectives: Set up a Linear Programming Problem Solve a Linear Programming Problem.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,
Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.
Chapter 4 The Simplex Method
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.
Preserving Privacy in Clickstreams Isabelle Stanton.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Linear-Programming Applications
Linear Programming Chapter 13 Supplement.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
The Simplex Method. Standard Linear Programming Problem Standard Maximization Problem 1. All variables are nonnegative. 2. All the constraints (the conditions)
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
LINEAR PROGRAMMING. 2 Introduction  A linear programming problem may be defined as the problem of maximizing or minimizing a linear function subject.
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
3.4: Linear Programming Objectives: Students will be able to… Use linear inequalities to optimize the value of some quantity To solve linear programming.
Privacy-preserving data publishing
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
CSCI 347, Data Mining Data Anonymization.
Constraints Feasible region Bounded/ unbound Vertices
1 Algorithmic aspects of radio access network design in 4G cellular networks David Amzallag Computer Science Department, Technion Joint work with Seffi.
LINEAR PROGRAMMING 3.4 Learning goals represent constraints by equations or inequalities, and by systems of equations and/or inequalities, and interpret.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Copyright © 2006 Brooks/Cole, a division of Thomson Learning, Inc. Linear Programming: An Algebraic Approach 4 The Simplex Method with Standard Maximization.
Sullivan Algebra and Trigonometry: Section 12.9 Objectives of this Section Set Up a Linear Programming Problem Solve a Linear Programming Problem.
Approximation Algorithms Duality My T. UF.
Clustering Data Streams A presentation by George Toderici.
Approximation Algorithms based on linear programming.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Solving Linear Program by Simplex Method The Concept
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
An introduction to Approximation Algorithms Presented By Iman Sadeghi
Haim Kaplan and Uri Zwick
Expectation-Maximization
Linear Systems Chapter 3.
The Simplex Method: Standard Minimization Problems
Linear Programming Objectives: Set up a Linear Programming Problem
Fair Clustering through Fairlets ( NIPS 2017)
LINEARPROGRAMMING 4/26/2019 9:23 AM 4/26/2019 9:23 AM 1.
Approximation Algorithms for k-Anonymity
1.6 Linear Programming Pg. 30.
Presentation transcript:

Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu

Dilys Thomas PODS Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular clustering Future Work

Dilys Thomas PODS Medical Records IdentifyingSensitive SSNNameDOBRaceZip codeDisease 614Sara03/04/76Cauc94305Flu 615Joan07/11/80Cauc94307Cold 629Kelly05/09/55Cauc94301Diabetes 710Mike11/23/62Afr-A94305Flu 840Carl11/23/62Afr-A94059Arthritis 780Joe01/07/50Hisp94042Heart problem 619Rob04/08/43Hisp94042Arthritis

Dilys Thomas PODS De-identified Medical Records Sensitive AgeRaceZip codeDisease Cauc94305Flu 07/11/80Cauc94307Cold 05/09/55Cauc94301Diabetes 11/23/62Afr-A94305Flu 11/23/62Afr-A94059Arthritis 01/07/50Hisp94042Heart problem 04/08/43Hisp94042Arthritis 03/04/76

Dilys Thomas PODS k-Anonymity model Uniquely identify you! Sensitive DOBRaceZip codeDisease 03/04/76Cauc94305Flu 07/11/80Cauc94307Cold 05/09/55Cauc94301Diabetes 12/30/72Afr-A94305Flu 11/23/62Afr-A94059Arthritis 01/07/50Hisp94042Heart problem 04/08/43Hisp94042Arthritis Quasi-identifiers: approximate foreign keys

Dilys Thomas PODS k-Anonymity Model [Swe00] Suppress some entries of quasi-identifiers – each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers Individual records hidden in a crowd of size k

Dilys Thomas PODS Anonymized Table DOBRaceZip codeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 11/23/62Afr-A*Flu 11/23/62Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis

Dilys Thomas PODS k-Anonymity Optimization Minimize the number of generalizations/ suppressions to achieve k-Anonymity NP-hard to come up with minimum suppressions/ generalizations.[MW04]  (k) approximation for k-anonymity [AFK+05]  (k) lower bound on approximation ratio with graph assumption

Dilys Thomas PODS Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

Dilys Thomas PODS Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

Dilys Thomas PODS Anonymity with Suppression AgeSalary Amy** Brian** Carol** David** Evelyn** All attributes suppressed

Dilys Thomas PODS Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

Dilys Thomas PODS Anonymity with Generalization AgeSalary Amy Brian Carol David Evelyn Generalization allows pre-specified ranges

Dilys Thomas PODS Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

Dilys Thomas PODS Anonymity with Clustering AgeSalary Amy[25-29][50-100] Brian[25-29][50-100] Carol[25-29][50-100] David[35-39][ ] Evelyn[35-39][ ] Cluster centers published 27=( )/3 70=( )/3 37=(35+39)/2 115=( )/2

Dilys Thomas PODS Advantages of Clustering Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations Clustering allows constant factor approximation algorithms

Dilys Thomas PODS Quasi-Identifiers form a Metric Space Convert quasi-identifiers into points in a metric space Distance function, D, on points –D(X,X)=0 Reflexive –D(X,Y)=D(Y,X) Symmetric –D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality

Dilys Thomas PODS Metric Space Converting (gender, zip code, DOB) into points in a metric space not easy. Define distance function on each attribute. E.g. on Zip code: –D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2. Weight attributes, weighted sum of attribute distances gives metric.

Dilys Thomas PODS Clustering for Anonymity Cluster Quasi-identifiers so that each cluster has at least r members for anonymity. Publish cluster centers for anonymity with number of point and radius Tight clusters  Usefulness of data for mining Large number of points per cluster  Anonymity

Dilys Thomas PODS Quasi-identifiers: Metric Space Assume further that the distance metric has been already defined on quasi-identifiers

Dilys Thomas PODS Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

Dilys Thomas PODS r-Gather Clustering 10 points, radius 5 20 points, radius points, radius 20 Minimize the maximum radius: 20

Dilys Thomas PODS Results 2 Approximation to minimize maximum radius with cluster size constraint Matching Lower bound of 2 for maximum radius minimization

Dilys Thomas PODS r-Gather Clustering 2d

Dilys Thomas PODS Lower Bound: Reduction from 3-SAT X1TX1T X1FX1F X2TX2T X2FX2F r-2 points r-gather with radius 1 iff formula satisfiable Else radius ¸ 2 C 1 =X 1 Æ X 2 C1C1

Dilys Thomas PODS Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

Dilys Thomas PODS Cellular Clustering 10 points, radius 5 20 points, radius points, radius 20

Dilys Thomas PODS Cellular Clustering Metric 10 points, radius 5 20 points, radius points, radius 20 Cellular Clustering Metric: 10*5 + 20* *20 = = 1250

Dilys Thomas PODS Cellular Clustering Primal dual 4-approximation algorithm for cellular clustering Constant factor approximation to minimum cluster size –Each cluster has at least r points

Dilys Thomas PODS Cellular Clustering: Linear Program Minimize  c (  i x ic d c + f c y c ) Sum of Cellular cost and facility cost Subject to:  c x ic ¸ 1 Each Point belongs to a cluster x ic · y c Cluster must be opened for point to belong 0 · x ic · 1 Points belong to clusters positively 0 · y c · 1 Clusters are opened positively

Dilys Thomas PODS Dual Program Maximize  i  i Subject to:  i  ic · f c (1)  i -  ic · d c (2)  i ¸ 0  ic ¸ 0 Overview of Algorithm: First grow  i keeping  ic =0 till (2) becomes tight then grow  ic at same rate till (1) becomes tight

Dilys Thomas PODS Future Work Improve approximation ratio for Cellular Clustering Improve Running time. Presently r-gather is O(n 2 ) while cellular clustering is a linear program over n 2 variables. –Linear or even sub-linear time algorithms Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.

Dilys Thomas PODS THANK YOU! QUESTIONS?