Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim.

Similar presentations


Presentation on theme: "Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim."— Presentation transcript:

1 Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim

2 /21 Contents 1.Introduction 2.Related work 3.Problem Statement 4.Distributed Anonymization 5.R-Tree Generalization 6.Performance Analysis 7.Conclusion

3 /21 1. Introduction  Cloud computing is a long dreamed vision of Computing Cloud consumers can remotely store their data into the cloud  To enjoy the on-demand though quality applications and services from a shared pool of configurable computing resources  Successful third party cases Examples of success cases on EC 2 include Nimbus Health[2]  Manages patient medical records Examples of success cases on ShareThis[3]  A social content-sharing network that has shared 340 million items across 30,000 web sites 3

4 /21 1. Introduction  Vulnerable data privacy Unfortunately, such data sharing is subject to constraints impose by privacy of individuals  Consistent with related works on cloud security[4][6][7][8] Researchers have show that attackers could effectively target and observe information  Third party clouds[9]  To protect data privacy, the sensitive information of individuals should be preserved Partition-based privacy preserving data publishing techniques  K-anonymity, (a,k)anonymity, l-diversity, t-closeness, m-invarance, etc.. 4

5 /21 1. Introduction  Privacy preserving data publishing for single dataset has been extensively studied Generalization, suppression, perturbation  Xiong et al,[5] Data anonymization for horizontally partitioned datasets A distributed anonymization protocol  Only gave a uniform approach that exerts the same level of protection for all data providers  How to design a new distributed anonymization protocol over cloud servers Propose a new distributed anonymization protocol We design an algorithm which inserts data object into an R-Tree for anonymization on top of the k-anonymity and l-diversity principle 5

6 /21 2. Related Work  Privacy preserving data publishing K-anonymity[11], (a,k)anonymity[12], l-diversity[13], t-closeness[30], m- invarance[14] designed a criteria for judging whether a published dataset provides a certain privacy preservation In this study, our distributed anonymization protocol is built top of the k- anonimity and l-diversity principle We propose new anonymization algorithm by inserting all the data object into an R-tree to achieve high quality generalization 6

7 /21 2. Related Work  Distributed anonymization solutions Naïve solution  Each data provider to implement data anonymization independently  Since the data is anonymized before integration, main drawback of this solution is that it will cause low data utility  Assumes the existence of a third party that can trusted by all data providers Trusted third party is not always feasible  Compromise of the server by attackers could lead to a complete privacy loss for all participating parties and data subject 7

8 /21 2. Related Work  Jiang et al.,[26] presented a two-party framework along with and application  Zhong et al.[27] proposed provably private solutions without disclosing data from one site to the other  Xiong et al.[25] presented a distributed anonymization protocol  In contrast to the above work, our work is aimed at outsourcing data provider provider’s private dataset to cloud servers for data sharing 8

9 /21 3. Problem Statement  The union of all local databases denoted as microdata set D as given in Definition 1  Each site produces a local anonymized databases d i * Meets its own privacy principle k i since data providers have different privacy requirements for publishing 9

10 /21 3. Problem Statement 10 Each site produces a local anonymized database d i * Node1Node2Node3

11 /21 3. Problem Statement(Goal)  Privacy for Data Objects Based on Anonymity k-anonymity[11][19]  A set of k records to be indistinguishable from each other based on a quasi- identifier group(sensitive attribute group) l-diversity[13]  each equivalence class contains at least l diverse sensitive values  Privacy between Data providers Our second privacy goal is to avoid the attack between data providers, in which individual dataset reveal nothing about data to the other data providers apart from the virtual anonymized database  We use distributed anonymization algorithm to build a virtual K-anonymous database and ensure the locally anonymized table d i * to be k i -anonymous –Use R-tree 11

12 /21 4. Distribute Anonymization  Protocol The main idea of the distributed anonymization protocol is to use secure multi-servers computation protocols to realize the R- tree generalization method for the cloud settings  Notation I : d-dimensional rectangle which is the bounding box of the QI group’s QI values Num : the total number of data objects in the equivalence class 12

13 /21 4. Distribute Anonymization  Example of generalization Equivalence class(QI group) of Node0 from [11-13][5200- 5300] to [11-30][5200-5300] Equivalence class(QI group) of Node1 from [73-80][5200- 5300] to [65-80][5400-5500] Equivalence class(QI group) of Node2 from [65-76][5200- 5300] to [65-80][5400-5500] 13

14 /21 4. Distribute Anonymization  Example of Split Process When e3 is inserted, the R-tree node splits into two group, e1 and e3 into one group When the r4 comes, e1 and e3 will be split into one group, e2 e4 into other At last, e5 comes, e2 and e4 in one group and e5 the other 14

15 /21 5. R-Tree Generalization  Index structure Leaf node  (I, SI) –I : d-dimensional rectangle which is the bounding box of the QI group’s QI values –SI : sensitive information for a tuple Non-leaf node  (I, childPointer) –I : covers all rectangles in the lower nodes entries –childPointer : the address of a lower node in the R-tree 15 IchildPointer ISII

16 /21 5. R-Tree Generalization  Insertion At the root level, the algorithm choose the entry whose rectangle needs the least area enlargement to cover a, so R 1 is selected for its rectangle dose not need to be enlarged, while the rectangle of R 2 needs to expand considerably  Node Splitting(when leaf node occurs overflow) Picks two seeds from the entries that would get the largest area enlargement when covered by a single rectangle One at a time is chosen to be put in one of the two groups 16

17 /21 6. Performance Analysis  Experimental environment Amazon’s EC2 platform Implement in Java 1.6.0.13 and run on set of EC 2 computing units Each computing unit is a small instance of EC2 with 1.7GHz Xeon processor 1.7GB memory, and 160 Hard disk Computing units are connected via 250Mbps network links  We use three different dataset with Uniform, Gaussian and Zipf distribution to evaluate our distributed anonymization scheme 17

18 /21 6. Performance Analysis  Dataset and Setup All the 100K tuples is located in one centralized database  Data are distributed among the 10 nodes and we use the distributed anonymizaion approach presented in Section 4 R-tree generalization algorithm was used to generalize the database to be K-anonymous DM(discernibility metric) assigns each tuple r i * in D * a penalty which is determined by the size of the equivalence class containing it 18

19 /21 6. Performance Analysis  Absolute error = | actual – estimate | Actual is the correct range query answer number Estimate is the number of candidate set computed from the anonymous table 19

20 /21 Conclusion  Two direction have presented A distributed anonymization protocol for privacy-preserving data publishing from multiple data providers in a cloud system. A new anonymization algorithm using R-Tree index structure  Future work Developing a protocol toolkit incorporating more privacy principle like differential privacy Building indexes based on anonymized cloud data to offer more efficient and reliable data analysis 20

21 /21 Q/A  Thank you for listening my presentation 21

22 /21 References  http://www.cs.cmu.edu/~jblocki/Slides/K-Anonymity.pdf 22

23 /21  Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases 23


Download ppt "Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim."

Similar presentations


Ads by Google