Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1.

Similar presentations


Presentation on theme: "Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1."— Presentation transcript:

1 Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1

2 Drivers of Data Computing 2 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity Reliability Security Privacy Usability

3 4V’s 3

4 AVC Denial Log Analysis 4 Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR on AWS

5 Social Media Customer Analytics 5 Network topology (friendship,followship,interaction) namesexagediseasesalary AdaF18cancer25k BobM25heart110k … idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Product and review Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Unstructured text (e.g., blog, tweet) Transaction database Variety, Veracity 10GB tweets per day Belk and Lowe’s UNCC Chancellor’s special fund

6 A Single View to the Customer Customer Social Media Gaming Entertain Banking Finance Banking Finance Our Known History Our Known History Purchase

7 Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Anti-discrimination Learning 7

8 Privacy Breach Cases Nydia Velázquez (1994)  Medical record on her suicide attempt was disclosed AOL Search Log (2006)  Anonymized release of 650K users’ search histories lasted for less than 24 hours NetFlix Contest (2009)  $1M contest was cancelled due to privacy lawsuit 23andMe (2013)  Genetic testing was ordered to discontinue by FDA due to genetic privacy 8

9 Acxiom Privacy  In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."  In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data. Security  In 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients. 9

10 10 Most restrictedRestricted Some restrictions Minimal restrictions Effectively no restrictions No legislation or no information Privacy Regulation -- Forrester

11 Privacy Protection Laws USA HIPAA for health care Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for children online privacy State regulations, e.g., California State Bill 1386 Canada PIPEDA 2000 - Personal Information Protection and Electronic Documents Act European Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices 11

12 Privacy Preserving Data Mining 12 ssnnameziprace…ageSexincome…disease 28223Asian…20M85k…Cancer 28223Asian…30F70k…Flu 28262Black…20M120k…Heart 28261White…26M23k…Cancer..…...…. 28223Asian…20M110k…Flu 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, l- diversity, t-closeness) Randomization

13 13 Privacy Preserving Data Mining 13

14 Social Network Data 14 Data owner Data miner release namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k

15 Threat of Re-identification 15 idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k Attacker attack Privacy breaches Identity disclosure Link disclosure Attribute disclosure

16 Privacy Preservation in Social Network Analysis Input Perturbation K-anonymity Generalization Randomization 16

17 Our Work Feature preservation randomization  Spectrum preserving randomization (SDM08)  Markov chain based feature preserving randomization (SDM09) Reconstruction from randomized graph (SDM10) Link privacy (from the attacker perspective)  Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award)  Exploiting graph space via Markov chain (SDM09) 17

18 Spectrum Preserving Randomization [SDM08] Spectral Switch: To increase the eigenvalue: To decrease the eigenvalue: 18

19 Reconstruction from Randomized Graph [SDM10] We can reconstruct a graph from such that w/o incurring much privacy loss 19

20 20 Original Exploiting graph space [SDM09]

21 PSNet (NSF-0831204) 21

22 Output Perturbation 22 Data owner Data miner namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Query f Query result + noise Cannot be used to derive whether any individual is included in the database

23 Differential Guarantee [Dwork, TCC06] 23 namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu f count(#cancer) f(x) + noise namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu K K f count(#cancer) f(x’) + noise 3 + noise 2 + noise achieving Opt-Out

24  is a privacy parameter: smaller  = stronger privacy Differential Privacy 24

25 Calibrating Noise 25 Laplace distribution Sensitivity of function global sensitivity l ocal sensitivity

26 Sensitivity 26 namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Function fsensitivity Count(#cancer)1 Sum(salary)u (domain upper bound) Avg(salary)u/n Data mining tasks can be decomposed to a sequence of simple functions. L-1 distance for vector output

27 Challenge in OSN 27 [1,1,3,3,3,3,2][1,1,3,3,2,2,2] Degree sequence,  D=2, noise from Lap(2/  ) is needed n-2 0 # of triangles,  =n-2, huge noise is needed High sensitivity!

28 Advanced Mechanisms Possible theoretical approaches  Smooth sensitivity  Exponential mechanism  Functional mechanism  Sampling 28

29 Our Work DP-preserving cluster coefficient ( ASONAM12 ) DP-preserving spectral graph analysis (PAKDD13) Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper) DP-preserving graph generation based on degree correlation (TDP13) Regression model fitting under differential privacy and model inversion attack (IJCAI 15) DP-preservation for deep auto-encoders ( AAAI 16 ) 29

30 SMASH (NIH R01GM103309) 30

31 Genetic Privacy (NSF 1502273 and 1523115) 31 BIBM13 Best Paper Award

32 Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Anti-discrimination Learning 32

33 What is discrimination?  Discrimination refers to unjustified distinctions of individuals based on their membership in a certain group.  Federal Laws and regulations disallow discrimination on several grounds:  Gender, Age, Marital Status, Sexual Orientation, Race, Religion or Belief, Disability or Illness ……  These attributes are referred to as the protected attributes. protected groups

34 Predictive Learning Finding evidence of discrimination Building non discriminatory classifiers

35 Motivating Example 35 namesexageprogramacceptance AdaF18cancer+ BobM25heart_ CathyF20cancer+ EdM60cancer_ FredM24flu_ … Suppose 2000 applicants, 1000 M and 1000 F Acceptance ratio 36% M vs. 24% F Do we have discrimination here?

36 Discrimination Discovery  Assuming a causal Bayesian network that faithfully represents the data.  Discriminatory effect if ∆P > τ, where τ is a threshold for discrimination depending on law (e.g., 5%). Protected attribute Decision attribute c+, c-e+, e- ∆P = P(e+|c+) − P(e+|c−)

37 Motivate Examples  Case I  Case II ∆P = 0.1 ∆P = -0.01

38 Motivate Examples  Case II  Case III ∆P + + ∆P - + ∆P = -0.01 ∆P = 0.104

39 Discrimination Analysis Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. (Wikipedia) Tweets discrimination analysis aims to detect whether a tweet contains discrimination against gender, race, age, etc.

40 A Typical Deep Learning Pipeline for Text Classification Text Word Representation Multilayer Perception Recursive Neural Network Recurrent Neural Network Convolutional Neural Network Deep Learning Model Text Representation Softmax Classifier word semantic compositiontext Text Representation

41 Word Embeddings Tweet …

42 Word Embeddings Tweet … … LSTM-RNN …

43 Word Embeddings Tweet … … LSTM-RNN … Tweet Representation Mean Pooling

44 Word Embeddings Tweet … … LSTM-RNN … Mean Pooling Logistic Regression Tweet Representation

45 Summary 1. Preserving Privacy Values 2. Educating Robustly and Responsibly 3. Big Data and Discrimination 4. Law Enforcement & Security 5. Data as a Public Resouce 45

46 Acknowledgement 46 Collaborators: UNCC: Aidong Lu, Xinghua Shi, Yong Ge Oregon: Jun Li, Dejing Dou PeaceHealth: Brigitte Piniewski UIUC: Tao Xie DPL members: UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying. PhD students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting) UofA: Lu Zhang (postdoc), Yongkai Wu, Cheng Si, Miao Xie, Shuhan Yuan Funding support:

47 Genome Wide Association Study 47


Download ppt "Xintao Wu Nov 19,2015 Social Computing in Big Data Era – Privacy Preservation and Fairness Awareness 1."

Similar presentations


Ads by Google