Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xintao Wu Aug 25,2014 Research Overview 1. Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation.

Similar presentations


Presentation on theme: "Xintao Wu Aug 25,2014 Research Overview 1. Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation."— Presentation transcript:

1 Xintao Wu Aug 25,2014 Research Overview 1

2 Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Fraud Detection in Social Networks  Spectral analysis of graph topology  Detecting Random Link Attacks Detecting weak anomalies Sample Projects Conclusions and Future work 2

3 Trustworthy Computing Trustworthy = reliability, security, privacy, usability Sample research challenges  Understand and capture emergent behaviors/interactions among regular users, fraudsters, and victims  Design secure, survivable, persistent systems when under attack  Enable privacy protection in collecting/analyzing/sharing personal data 3

4 Privacy Breach Cases Nydia Velázquez (1994)  Medical record on her suicide attempt was disclosed AOL Search Log (2006)  Anonymized release of 650K users’ search histories lasted for less than 24 hours NetFlix Contest (2009)  $1M contest was cancelled due to privacy lawsuit 23andMe (2013)  Genetic testing was ordered to discontinue by FDA due to genetic privacy 4

5 Acxiom Privacy  In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."  In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data. Security  In 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients. 5

6 6 Most restrictedRestricted Some restrictions Minimal restrictions Effectively no restrictions No legislation or no information Privacy Regulation -- Forrester

7 Privacy Protection Laws USA HIPAA for health care Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for children online privacy State regulations, e.g., California State Bill 1386 Canada PIPEDA Personal Information Protection and Electronic Documents Act European Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices 7

8 Privacy Preserving Data Mining 8 ssnnameziprace…ageSexincome…disease 28223Asian…20M85k…Cancer 28223Asian…30F70k…Flu 28262Black…20M120k…Heart 28261White…26M23k…Cancer..…...… Asian…20M110k…Flu 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, l- diversity, t-closeness) Randomization

9 Social Network Data 9 Data owner Data miner release namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k

10 Threat of Re-identification 10 idSexagediseasesalary 5FYcancer25k 3MYheart110k 6FYcancer70k 1MOflu65k 7MOcancer300k 2MYflu20k 9MYcancer45k 4MMflu95k 8FMheart70k Attacker attack Privacy breaches Identity disclosure Link disclosure Attribute disclosure

11 Privacy Preservation in Social Network Analysis Input Perturbation K-anonymity Generalization Randomization Output Perturbation Background on differential privacy Differential privacy preserving social network mining 11

12 Our Work Feature preservation randomization  Spectrum preserving randomization (SDM08)  Markov chain based feature preserving randomization (SDM09) Reconstruction from randomized graph (SDM10) Link privacy (from the attacker perspective)  Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award)  Exploiting graph space via Markov chain (SDM09) 12

13 PSNet (NSF ) 13

14 Output Perturbation 14 Data owner Data miner namesexagediseasesalary AdaF18cancer25k BobM25heart110k CathyF20cancer70k DellM65flu65k EdM60cancer300k FredM24flu20k GeorgeM22cancer45k HarryM40flu95k IreneF45heart70k Query f Query result + noise Cannot be used to derive whether any individual is included in the database

15 Differential Guarantee [Dwork, TCC06] 15 namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu f count(#cancer) f(x) + noise namedisease Adacancer Bobheart Cathycancer Dellflu Edcancer Fredflu K K f count(#cancer) f(x’) + noise 3 + noise 2 + noise achieving Opt-Out

16 Our Work DP-preserving cluster coefficient ( ASONAM12 )  Divide and conquer  Smooth sensitivity DP-preserving spectral graph analysis (PAKDD13)  LNPP: based on the Laplace Noise Perturbation  SBMF: based on the Exponential Mechanism and MBF density Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper) DP-preserving graph generation based on degree correlation (TDP13) 16

17 SMASH (NIH R01GM103309) 17

18 Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Fraud Detection  Spectral analysis of graph topology  Detecting Random Link Attacks  Detecting weak anomalies Sample Projects Conclusions and Future work 18

19 Cyber Fraud Cyber crime  cost US economy $400 Billion annually OSN Fraud and Attack  Sybil attack, spam, viral marketing, fraudulent auction, brand jacking, denial of service, etc.  Fake followers on Twitter (used in viral marketing) worth $360 million annually on the black market. 19

20 Fraud Characterization Individual vs. collusive Robot vs. money-motivated regular user Random vs. selective target Static vs. dynamic Traditional topology-based detection methods incur high computational cost difficult to detect collaborative attacks or subtle anomalies Topology-based Detection 20

21 An abstraction of collaborative attacks including spam, viral marketing, etc. The attacker creates some fake nodes and uses them to attack a large set of randomly selected regular nodes; Fake nodes also mimic the real graph structure among themselves to evade detection. Random Link Attack [Shirvastava ICDE08] 21

22 Spectral Graph Analysis based Fraud Detection Examine the spectral space of graph topology. A network with n nodes and m edges that is undirected, un- weighted, and without considering link/node attribute information Adjacency Matrix A (symmetric) Adjacency Eigenspace 22

23 Eigenspace 23 PrincipalMinor

24 Projecting Node in Spectral Space [SDM09] 24 Spectral coordinate: k-orthogonal line pattern when nodes u, v from the same community when nodes u, v from different communities

25 Example 25 Spectral coordinate: Polbook Network

26 A snapshot of websites in domain.UK (2007) (114K nodes and 1.8M links), add a mix of 8 RLAs with varied sizes and connection patterns. SPCTRA: based on spectral space GREEDY: based on outer-triangles [Shrivastava, ICDE08] Evaluation on Web spam challenge data [ICDE11] 26 Much faster 36s vs. 26h

27 Outline Introduction Privacy Preserving Social Network Analysis Input perturbation Output perturbation Fraud Detection Spectral analysis of graph topology Detecting random link attacks Detecting weak anomalies Sample Projects Conclusions and Future work 27

28 28 Privacy Preserving Data Mining (NSF CAREER) 28

29 Genetic Privacy ( NSF SCH pending) 29 BIBM13 Best Paper Award

30 oSafari ( NSF SaTC) 30

31 Manipulation in E-Commerce (NSF III pending) 31 Structured Topic Analysis Spectral Bipartite Graph Analysis D-S based Evidence Fusion Bot-committed Money-motivated Reviews Ratings Ranks

32 GWAS Genome-wide association studies (GWAS) typically focus on associations between single- nucleotide polymorphism s (SNPs) and human traits like common diseases. 32

33 Privacy Preserving Database Application testing (NSF ) ER Data DDL Catalog Production db RNRS Conflict resolution Disclosure Assessment Rule Analyzer R’NR’S’ Schema & Domain Filter Schema’Domain’ Data GeneratorMock DB User 33

34 Data Generation for Testing DB Applications (NSF ) How to generate data to cover paths? 34

35 Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation Fraud Detection  Spectral analysis of graph topology  Detecting Random Link Attacks  Detecting weak anomalies Sample Projects Conclusions and Future work 35

36 Big Data Computing Drowning in data  Volume, Velocity, Variety, and Veracity  2.5 Exabyte every day  Web data, healthcare, e-commerce, social network Advancing technology  Cheap storage/processing power  Growth in huge data centers  Data is in the “cloud”- Amazon AWS, Hadoop, Azure  Computing is in the “cloud” 36

37 Social Media Customer Analytics 37 Network topology (friendship,followship,interaction) namesexagediseasesalary AdaF18cancer25k BobM25heart110k … idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Product and review Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Unstructured text (e.g., blog, tweet) Transaction database Velocity, Variety 10GB tweets per day Belk and Lowe’s Chancellor’s special fund

38 38

39 39

40 Samsung AVC Denial Log Analysis 40 Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR

41 Drivers of Data Computing 41 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity Reliability Security Privacy Usability

42 Thank You! Questions? 42 Collaborators: Aidong Lu, Xinghua Shi, Jun Li (Oregon), Dejing Dou (Oregon), Tao Xie (UIUC) Doctoral graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying Doctoral Students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting)


Download ppt "Xintao Wu Aug 25,2014 Research Overview 1. Outline Introduction Privacy Preserving Social Network Analysis  Input perturbation  Output perturbation."

Similar presentations


Ads by Google