Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.

Similar presentations


Presentation on theme: "Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava."— Presentation transcript:

1 Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava

2 Outline Privacy in graph data publishing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

3 Microdata publishing Data publishing Macrodata Pre-aggregated statistics (N.R. Adam et al. ACM Computing Surveys, 1989.) Microdata Individual records Concerns in microdata release Privacy of individual tuple Privacy of atomic values (e.g. SSN) Association between tuple ’ s attributes Accuracy of aggregate query answering

4 Graph data Relationship among entities No sensitive attributes Private information is the association Many graphs of interest are sparse Examples: General graph social network, etc. Who talks to whom Bipartite graph (focus of our work) customer shopping record, etc. Who bought what

5 Example Graph Data Author IDName A1Andy A2Bob A3Cathy Paper IDTitleConferenceyear P1ASIGMOD2006 P2BSIGMOD2007 P3CVLDB2007 P4DICDE2007 Author IDPaper ID A1P1 A1P2 A2P2 A2P3 A2P4 A3P1 A3P4 A1 A2 A3 P2 P3 P4 P1 Author Paper (author, paper) Association authorpaper

6 Privacy-preserving microdata sharing Current status Focus on the protection of the association between quasi identifiers and a single sensitive attribute Disease, salary, etc. Related work k-anonymity (Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems’ 02) l-diversity (A. Machanavajjhala et al., ICDE’ 06) t-closeness (N. Li et al., ICDE’ 07) (k,e)-anonymity (Q. Zhang, N. Koudas, D. Srivastava, T. Yu, ICDE’ 07)

7 Related work Attacking anonymized social network (L. Backstrom et al., WWW ’ 07 ) Active attack: insert nodes/links Passive attack: collude and observe graph Privacy risks of public mentions (D. Frankowski et al., SIGIR ’ 07) Link the movie score and movie review databases How to break anonymity of the netflix prize dataset (A. Narayannan et al., UT Austin) Attack through background information How to assemble pieces of a graph privately (K. Frikken et al., WPES ’ 06) Distributed graph construction via multi-party computation

8 Focus of our work Privacy protection in bipartite graph Protect individual link information across two parties e.g. (author,paper) association Maintain aggregate graph statistics e.g. average number of coauthors, diameter of graph, shortest path distribution, etc. Not considered by previous work

9 Dataset working on DBLP(conference data only) 402023 distinct authors, 541243 distinct papers 1401349 author-paper pairs most number of papers of one author: 290 most number of authors of one paper: 115 Graph statistics we are looking at 1 st order statistics (node degree) Number of papers of each author Number of authors of each paper 2 nd order statistics Coauthors of each author Copapers of each paper Higher order statistics Walking more steps along the bipartite graph

10 Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

11 Anonymization by permutation Publish the (author, paper) relation Permute paper w.r.t. author Global permutation Various partition mechanisms Study the graph statistics 2 nd order statistics studied coauthor times Avg number of papers coauthored by each coauthor pair coauthor distribution Avg number of coauthors of each author copaper times Avg number of authors shared by each copaper pair copaper distribution Avg number of copapers of each paper

12 coauthor times Source statistics After global permutation Number of papers coauthoredNumber of coauthor pairs 13411970 21377 38

13 coauthor distribution

14 More bad news Source distribution The author with the most number of coauthors (363), has 247 publications (7th) the author with the most number of publications (290), has only 44 co-authors Correlation is weak After global permutation The most number of coauthors is 779, The corresponding author has 287 papers (2 nd most). The author with the most number of papers (290) has 722 coauthors. False correlation created!

15 Other experiments Results are the same for copaper statistics other partitioning mechanisms On authorCt, year, conference, etc.

16 Observations Permutation of (author, paper) relation guarantees preservation of 1 st order statistics Degrees are just counts Cannot maintain even 2 nd order graph statistics Break the clustering properties Remove links within cluster Introduce fake links among clusters Need other anonymization techniques to maintain graph statistics

17 Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

18 Publish tuple-level statistics Publish two tables AuthorDegree(authorID, degree): coAuthor (authorID, coAuthorID) From these two tables, we can get the 1 st -order degree D1 of any author the set of 2 nd -order degree {D2} By joining the two tables We may leak more information! D1 and set of {D2} can serve as signatures to identify entities

19 Privacy Risk k-identifiable An entity shares the same signature with k-1 other entities 1-identifiable means uniquely identifiable Count the number of authors who have the same signatures maximum k=20015 coming from the authors who has D1=1 and D2=0 (single author – single paper pairs).

20 Attack simulation-author

21 Observations Too many entities can be uniquely identified 37 by D1 only, 134426 by {D1, {D2}} In order to protect your privacy You ’ d better not publish any paper Or, just publish one paper, without collaborating with anyone else Because many others are doing the same

22 Outline Privacy in graph data sharing Apply existing microdata anonymization techniques A simple graph anonymization technique Understanding attacks Plan of future work

23 Understanding attacks Given an anonymization scheme that preserves statistics, explore attacker ’ s ability What background information is available What strategy to take How much knowledge can he gain What ’ s the cost of attack Starting point: publishing complete statistics Publish complete author sets of each paper, and complete paper sets of each author

24 Publishing complete statistics Example Author set: {{a1,a4}, {a1,a2}, {a2,a3}, {a3,a4}} Paper set: {p1,p2}, {p2,p3}, {p3,p4}, {p1,p4}} a1 a2 a3 a4 p1 p2 p3 p4

25 Graph theoretic analysis Can be seen as publishing two isomorphic bipartite graphs Each graph removes labels on one side Bipartite graph isomorphism problem a1 a2 a3 a4 p1 p2 p3 p4

26 Solution to the problem Hardness of bipartite isomorphism is unknown may exist effective solution for graphs with specific properties Previous n th -order signature can serve as a greedy solution n is bounded by the diameter of the graph More information leakage when background information available node information edge information

27 Attacker with background information With node information Node a3 is known in previous example {a3, p3}, {a3, p4} is known to the attacker a1 can then be uniquely identified It ’ s the only node with distance 4 to a3 a1 can further help labeling of the isomorphic matching a1 a2 a3 a4 p1 p2 p3 p4a3 a1

28 Attacker with background information With Edge information Edge {a3,p3} is known {a1, p1} can be recovered By enumerating all possible worlds Disjunctive reasoning It ’ s a finer-grained attack model a1 a2 a3 a4 p1 p2 p3 p4 a3 ’’ a3 ’ a1 ’’ a1 ’

29 Plan of future work(1) Detailed study of the “ set of all isomorphism ” problem Algorithm and hardness How different background information helps Publish other statistics Binary/triple/ … sets of authors/papers For {a1, a2, a3}, publish {a1, a2}, {a2, a3}, {a1,a3} Maintain more statistics than permutation Maintain more privacy than publishing complete author sets How to evaluate it quantitively

30 Plan of future work(2) Other possible signatures Shortest path to other nodes Compute pairwise shorted path Sort the vector as signature Other datasets IMDB data

31 Thanks!


Download ppt "Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava."

Similar presentations


Ads by Google