Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Similar presentations


Presentation on theme: "Privacy Preserving Outlier Detection using Locality Sensitive Hashing"— Presentation transcript:

1 Privacy Preserving Outlier Detection using Locality Sensitive Hashing
CSTAR Privacy Preserving Outlier Detection using Locality Sensitive Hashing Nisarg Raval, Madhuchand Rushi Pillutla, Piyush Bansal, K Srinathan and C V Jawahar IIIT Hyderabad, India

2 Motivation

3 Motivation Trusted Third Party (TTP)

4 Motivation Can we avoid TTP ? Trusted Third Party (TTP)

5 Motivation Simulate Trusted Third Party

6 Privacy Preserving Outlier Detection
Alice and Bob have database of customer behavior. They together want to find fraudulent customers (outliers) in their respective database. Only outliers should be revealed. Individual data should be private.

7 Outlier Detection Approach
Statistics based Barnett et al. John Wiley 1994 Density based Papadimitriou et al. ICDE 2003 Distance based Knorr et al. VLDB 1998 Ramaswamy et al. SIGMOD 2000 Wang et al. ICDE 2011

8 Privacy Preserving Data Mining
Heuristic based Atallah et al. KDEW 1999 Verykios et al. KDE 2003 Reconstruction based Agrawal et al. SIGMOD 2000 Rizvi et al. VLDB 2002 Cryptography based Lindell et al. CRYPTO 2000 Clifton et al. SIGKDD 2002 Verykios et al. ; SIGMOD 2004

9 Related Work Vaidya et al. ICDM 2004 Zhou et al. EBISS 2009
Pair wise distance computation Secure Distance and Secure Comparison protocols Zhou et al. EBISS 2009 Homomorphic Encryption and Randomization

10 Our method is 10000 times faster on 1 Million data points!
Related Work Vaidya et al. ICDM 2004 Pair wise distance computation Secure Distance and Secure Comparison protocols Zhou et al. EBISS 2009 Homomorphic Encryption and Randomization Quadratic Cost Approximately 1012 operations on 1 Million data points. Our method is times faster on 1 Million data points!

11 Outlier Detection Distance based outliers [Knorr et al. VLDB 1998]
An object is an outlier if very large fraction of total objects lie outside the specified radius.

12 Outlier Detection Distance based outliers [Knorr et al. VLDB 1998]
An object is an outlier if very large fraction of total objects lie outside the specified radius.

13 Outlier Detection Distance based outliers [Knorr et al. VLDB 1998]
An object is an outlier if very large fraction of total objects lie outside the specified radius. Non Neighbors Neighbors

14 Outlier Detection Distance based outliers [Knorr et al. VLDB 1998]
An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Non Neighbors Neighbors

15 Outlier Detection Distance based outliers [Knorr et al. VLDB 1998]
An object is an outlier if very large fraction of total objects lie outside the specified radius. Outlier Non Neighbors Neighbors

16 Our Approach Converse of the definition
An object is non-outlier if it has enough neighbors within specified radius.

17 Our Approach Converse of the definition
An object is non-outlier if it has enough neighbors within specified radius.

18 Our Approach Converse of the definition
An object is non-outlier if it has enough neighbors within specified radius. Neighbors Non Neighbors

19 Our Approach Converse of the definition
An object is non-outlier if it has enough neighbors within specified radius. Non - Outlier Neighbors Non Neighbors

20 Easy to find small number of neighbors!
Our Approach Converse of the definition An object is non-outlier if it has enough neighbors within specified radius. Easy to find small number of neighbors! Non - Outlier Neighbors Non Neighbors

21 Locality Sensitive Hashing (LSH)
Property Condition Hash Family Similar objects are hashed to same bin

22 Centralized Algorithm
Outlier Detection Find Non Outliers Near Neighbor Queries LSH Madhuchand Rushi Pillutla, Nisarg Raval, Piyush Bansal, Kannan Srinathan and C.V. Jawahar LSH Based Outlier Detection and Its Application in Distributed Setting CIKM 2011

23 Distributed Settings Vertically distributed data
Person ID Bank Records Tax Records Police Records P1 BR1 TR1 PR1 P2 BR2 TR2 PR2 P3 BR3 TR3 PR3 P4 BR4 TR4 PR4 Vertically distributed data Each player has different attributes for the same set of objects

24 Distributed Settings Vertically distributed data
Person ID Bank Records Tax Records Police Records P1 BR1 TR1 PR1 P2 BR2 TR2 PR2 P3 BR3 TR3 PR3 P4 BR4 TR4 PR4 Vertically distributed data Each player has different attributes for the same set of objects Horizontally distributed data Each player has the same attributes for a subset of the total objects Person ID Bank Records Tax Records Police Records P1 BR1 TR1 PR1 P2 BR2 TR2 PR2 P3 BR3 TR3 PR3 P4 BR4 TR4 PR4

25 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure

26 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure How do we generate LSH bin structure privately ?

27 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure Private Hash Evaluation LSH based on p-stable distribution

28 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure Secure Evaluation of Dot Product (a.v ) Each player will generate values of vector a corresponding to the dimensions of v they have. Add the corresponding products to generate shares of dot product Using Secure Sum protocol generate final dot product

29 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure Perform near neighbor queries Secure Distance and Secure Comparison Many neighbors Quadratic Communication

30 Privacy in Vertical Distribution
Two phase: Generation of global LSH bin structure using information from all the players Find non outliers using generated global bin structure Perform near neighbor queries Secure Distance and Secure Comparison Many neighbors Quadratic Communication Can we break the Quadratic Bound ?

31 Approximate Near Neighbor Queries
Definition of outliers are subjective Unlike traditional LSH queries NO explicit distance calculation No communication required Hash Objects Count Neighbors Yes Non Outlier No Outlier

32 Need for Pruning No of queries = No of objects in database
Databases are very large Hash Objects Count Neighbors Can we reduce the number of queries? Yes Non Outlier No Outlier

33 Pruning Hash Objects Count Neighbors Yes No Outlier

34 Pruning Neighbors of a non outlier are also non outliers Hash Objects
Count Neighbors Yes Non Outliers No Outlier

35 < 1 % of total database needs to be processed!
Pruning Neighbors of a non outlier are also non outliers Hash Objects < 1 % of total database needs to be processed! Count Neighbors Yes Non Outliers No Outlier

36 Privacy in Horizontal Distribution
Data is the union of the set of objects all players have Steps: Generate local LSH bin structure Perform local pruning Communicate to obtain global neighbor information Perform global pruning

37 Privacy in Horizontal Distribution
Data is the union of the set of objects all players have Steps: Generate local LSH bin structure Perform local pruning Communicate to obtain global neighbor information Perform global pruning How do we obtain global neighbor information privately ?

38 Private Global Bin Structure
Construct global LSH bin labels Secure Union Protocol Add count of objects of corresponding bins Secure Sum protocol Perform global pruning using global bin structure

39 How do we reduce False Negatives ?
Approximation Error LSH is probabilistic Probability of being near neighbor is at least False neighbors may cause pruning of an outlier False Negatives How do we reduce False Negatives ?

40 Reducing False Negatives
Bin Threshold (BT) Neighbor only if it appears in at least (BT) bins Increasing BT will decrease False Negatives Hash Objects Count Neighbors Yes Non Outlier No Outlier

41 How do we reduce False Positives without increasing False Negatives?
Bin Threshold Bin Threshold may remove actual neighbors High Bin Threshold reduce pruning efficiency False Positives How do we reduce False Positives without increasing False Negatives?

42 Reducing False Positives
Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration 1 Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration 2 Intersection of Results Final Set of Outliers Compute Parameters Generate Bin Structure LSH Find Near Neighbors Prune Non Outliers Pruning Iteration n Multiple Runs Output

43 Analysis Setting Round Communication Computation Centralized Horizontal 3 Vertical 2 Security of the Algorithm depends on the security of Secure Union and Secure Sum protocols

44 Experimental Results Datasets Objects Attributes Corel 68040 32
MiniBooNE 130064 50 Landsat 275465 60 Darpa 458301 23 Household 3

45 Effect of Bin Threshold ( BT )
Increasing BT will increase detection rate but also increase false positives Optimal BT High detection rate Low false positives Corel Landsat Darpa Household

46 Effect of Iterations on False Positives
False positives decrease exponentially with increase in iterations Very small number of iterations needed to achieve low false positive rate

47 Communication Less than Quadratic
Superior than previously known best results Corel Landsat Up to times less communication on datasets of size 106 ! Darpa Household

48 False Positives can be considered as borderline outliers!
Performance Dataset Corel 0.011 0.62 82 147 MiniBooNE 0.006 0.04 157 366 Landsat 0.014 0.36 331 929 Darpa 0.031 0.68 550 1103 Household 0.009 0.14 1200 1326 False Positives can be considered as borderline outliers!

49 Conclusion Approximate Outlier Detection
Efficient Private algorithms for both Vertical and Horizontal Distribution Efficient Pruning based on LSH Scalable for large and high dimensional data Trade off between Accuracy and Cost

50 CSTAR Supported by Microsoft Research India Travel Grant


Download ppt "Privacy Preserving Outlier Detection using Locality Sensitive Hashing"

Similar presentations


Ads by Google