Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.

Slides:



Advertisements
Similar presentations
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Advertisements

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
1 Preserving Privacy in Collaborative Filtering through Distributed Aggregation of Offline Profiles The 3rd ACM Conference on Recommender Systems, New.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang †, Brian Thompson †, Hui Wang ‡, Danfeng Yao † †‡
2. Attacks on Anonymized Social Networks. Setting A social network Edges may be private –E.g., “communication graph” The study of social structure by.
Privacy in Social Networks:
The Union-Split Algorithm and Cluster-Based Anonymization of Social Networks Brian Thompson Danfeng Yao Rutgers University Dept. of Computer Science Piscataway,
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Structure based Data De-anonymization of Social Networks and Mobility Traces Shouling Ji, Weiqing Li, and Raheem Beyah Georgia Institute of Technology.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Privacy and trust in social network
Private Analysis of Graphs
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
Publishing Microdata with a Robust Privacy Guarantee
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala cs.duke.edu Collaborators: Daniel Kifer (PSU),
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte 2008 SIAM Conference on Data Mining, April 25 th Atlanta, Georgia.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
m-Privacy for Collaborative Data Publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Privacy Protection in Social Networks Instructor: Assoc. Prof. Dr. DANG Tran Khanh Present : Bui Tien Duc Lam Van Dai Nguyen Viet Dang.
Anonymity and Privacy Issues --- re-identification
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Location Privacy Protection for Location-based Services CS587x Lecture Department of Computer Science Iowa State University.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
De-anonymizing Genomic Databases Using Phenotypic Traits Humbert et al. Proceedings on Privacy Enhancing Technologies 2015 (2) :
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Outline Introduction State-of-the-art solutions
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Privacy Preserving Subgraph Matching on Large Graphs in Cloud
Privacy-preserving Release of Statistics: Differential Privacy
Personalized Privacy Protection in Social Networks
Associative Query Answering via Query Feature Similarity
Privacy Preserving Data Publishing
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Differential Privacy in Practice
Effective Social Network Quarantine with Minimal Isolation Costs
Postdoc, School of Information, University of Arizona
Personalized Privacy Protection in Social Networks
Presented by : SaiVenkatanikhil Nimmagadda
Published in: IEEE Transactions on Industrial Informatics
Refined privacy models
Presentation transcript:

Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava

Outline Privacy in graph data publishing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

Microdata publishing Data publishing Macrodata Pre-aggregated statistics (N.R. Adam et al. ACM Computing Surveys, 1989.) Microdata Individual records Concerns in microdata release Privacy of individual tuple Privacy of atomic values (e.g. SSN) Association between tuple ’ s attributes Accuracy of aggregate query answering

Graph data Relationship among entities No sensitive attributes Private information is the association Many graphs of interest are sparse Examples: General graph social network, etc. Who talks to whom Bipartite graph (focus of our work) customer shopping record, etc. Who bought what

Example Graph Data Author IDName A1Andy A2Bob A3Cathy Paper IDTitleConferenceyear P1ASIGMOD2006 P2BSIGMOD2007 P3CVLDB2007 P4DICDE2007 Author IDPaper ID A1P1 A1P2 A2P2 A2P3 A2P4 A3P1 A3P4 A1 A2 A3 P2 P3 P4 P1 Author Paper (author, paper) Association authorpaper

Privacy-preserving microdata sharing Current status Focus on the protection of the association between quasi identifiers and a single sensitive attribute Disease, salary, etc. Related work k-anonymity (Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems’ 02) l-diversity (A. Machanavajjhala et al., ICDE’ 06) t-closeness (N. Li et al., ICDE’ 07) (k,e)-anonymity (Q. Zhang, N. Koudas, D. Srivastava, T. Yu, ICDE’ 07)

Related work Attacking anonymized social network (L. Backstrom et al., WWW ’ 07 ) Active attack: insert nodes/links Passive attack: collude and observe graph Privacy risks of public mentions (D. Frankowski et al., SIGIR ’ 07) Link the movie score and movie review databases How to break anonymity of the netflix prize dataset (A. Narayannan et al., UT Austin) Attack through background information How to assemble pieces of a graph privately (K. Frikken et al., WPES ’ 06) Distributed graph construction via multi-party computation

Focus of our work Privacy protection in bipartite graph Protect individual link information across two parties e.g. (author,paper) association Maintain aggregate graph statistics e.g. average number of coauthors, diameter of graph, shortest path distribution, etc. Not considered by previous work

Dataset working on DBLP(conference data only) distinct authors, distinct papers author-paper pairs most number of papers of one author: 290 most number of authors of one paper: 115 Graph statistics we are looking at 1 st order statistics (node degree) Number of papers of each author Number of authors of each paper 2 nd order statistics Coauthors of each author Copapers of each paper Higher order statistics Walking more steps along the bipartite graph

Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

Anonymization by permutation Publish the (author, paper) relation Permute paper w.r.t. author Global permutation Various partition mechanisms Study the graph statistics 2 nd order statistics studied coauthor times Avg number of papers coauthored by each coauthor pair coauthor distribution Avg number of coauthors of each author copaper times Avg number of authors shared by each copaper pair copaper distribution Avg number of copapers of each paper

coauthor times Source statistics After global permutation Number of papers coauthoredNumber of coauthor pairs

coauthor distribution

More bad news Source distribution The author with the most number of coauthors (363), has 247 publications (7th) the author with the most number of publications (290), has only 44 co-authors Correlation is weak After global permutation The most number of coauthors is 779, The corresponding author has 287 papers (2 nd most). The author with the most number of papers (290) has 722 coauthors. False correlation created!

Other experiments Results are the same for copaper statistics other partitioning mechanisms On authorCt, year, conference, etc.

Observations Permutation of (author, paper) relation guarantees preservation of 1 st order statistics Degrees are just counts Cannot maintain even 2 nd order graph statistics Break the clustering properties Remove links within cluster Introduce fake links among clusters Need other anonymization techniques to maintain graph statistics

Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

Publish tuple-level statistics Publish two tables AuthorDegree(authorID, degree): coAuthor (authorID, coAuthorID) From these two tables, we can get the 1 st -order degree D1 of any author the set of 2 nd -order degree {D2} By joining the two tables We may leak more information! D1 and set of {D2} can serve as signatures to identify entities

Privacy Risk k-identifiable An entity shares the same signature with k-1 other entities 1-identifiable means uniquely identifiable Count the number of authors who have the same signatures maximum k=20015 coming from the authors who has D1=1 and D2=0 (single author – single paper pairs).

Attack simulation-author

Observations Too many entities can be uniquely identified 37 by D1 only, by {D1, {D2}} In order to protect your privacy You ’ d better not publish any paper Or, just publish one paper, without collaborating with anyone else Because many others are doing the same

Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

Understanding attacks Given an anonymization scheme that preserves statistics, explore attacker ’ s ability What background information is available What strategy to take How much knowledge can he gain What ’ s the cost of attack Starting point: publishing complete statistics Publish complete author sets of each paper, and complete paper sets of each author

Publishing complete statistics Example Author set: {{a1,a4}, {a1,a2}, {a2,a3}, {a3,a4}} Paper set: {p1,p2}, {p2,p3}, {p3,p4}, {p1,p4}} a1 a2 a3 a4 p1 p2 p3 p4

Graph theoretic analysis Can be seen as publishing two isomorphic bipartite graphs Each graph removes labels on one side Bipartite graph isomorphism problem a1 a2 a3 a4 p1 p2 p3 p4

Solution to the problem Hardness of bipartite isomorphism is unknown may exist effective solution for graphs with specific properties Previous n th -order signature can serve as a greedy solution n is bounded by the diameter of the graph More information leakage when background information available node information edge information

Attacker with background information With node information Node a3 is known in previous example {a3, p3}, {a3, p4} is known to the attacker a1 can then be uniquely identified It ’ s the only node with distance 4 to a3 a1 can further help labeling of the isomorphic matching a1 a2 a3 a4 p1 p2 p3 p4a3 a1

Attacker with background information With Edge information Edge {a3,p3} is known {a1, p1} can be recovered By enumerating all possible worlds Disjunctive reasoning It ’ s a finer-grained attack model a1 a2 a3 a4 p1 p2 p3 p4 a3 ’’ a3 ’ a1 ’’ a1 ’

Plan of future work(1) Detailed study of the “ set of all isomorphism ” problem Algorithm and hardness How different background information helps Publish other statistics Binary/triple/ … sets of authors/papers For {a1, a2, a3}, publish {a1, a2}, {a2, a3}, {a1,a3} Maintain more statistics than permutation Maintain more privacy than publishing complete author sets How to evaluate it quantitively

Plan of future work(2) Other possible signatures Shortest path to other nodes Compute pairwise shorted path Sort the vector as signature Other datasets IMDB data

Thanks!