Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Privacy-Preserving Databases and Data Mining Yücel SAYGIN ysaygin@sabanciuniv.edu http://people.sabanciuniv.edu/~ysaygin/

Privacy and data mining There are two aspects of data mining when we look at it from a privacy perspective Being able to mine the data without seeing the actual data Protecting the privacy of people against the misusage of data

How can we protect the sensitive knowledge against data mining? Types of sensitive knowledge that could be extracted via data mining techniques are Patterns (Association rules, sequences) Clusters that describe the data Classification models for prediction

Association Rule Hiding Large amounts of customer transaction data is collected in supermarket chains to find association rules in customer buying patterns lots of research conducted on finding association rules efficiently and tools were developed. Association rule hiding algorithms are deterministic with given support and confidence thresholds Therefore association rules are a good starting point.

Motivating examples Sniffing prozac users

Association Rule Hiding  Rules: “Body ® Head”  Ex1: “Diapher ® Beer”  Ex2: “Internetworking with TCP/IP ® ” Interconnections: bridges, routers,…”  parameters: (support, confidence)  Minimum Support, and Confidence Thresholds are used to prune the non-significant rules

Min. support 50% Min. confidence 70%

Algorithms for Rule Hiding What we try to achieve is: Let D be the source database Let R be the set of significant association rules that are mined from D with certain thresholds Let ri be a sensitive rule in R Transform D into D’ so that all rules in R can still be mined from D’ except ri It was proven that optimal hiding of association rules with minimal side effects is NP-Hard

Heuristic Methods We developed heuristics to deal with the problem. Different techniques are implemented based on: Modifying the database by inserting false data or by removing some data. Inserting unknown values to fuzzify the rules

Basic Approach for Rule Hiding Reduce the support of confidential rules Reduce the confidence of rules This way prevent tools to discover these rules The challenge is the data quality Our metric for data quality is the number of rules that can still be mined and the number of rules that appear as a side effect We developed heuristic algorithms to minimize the newly appearing rules, and to minimize the accidentally hidden rules.

Basics of the heuristic algorithms If we want to remove an item from a transaction to reduce the support or the confidence Which item should we start from Which transaction should we choose to hide the selected item We can either Select an item and a transaction in round robin fashion, I.e., select the next item from the next transaction that supports that item, and move to another item and another transaction. Select the item that will probably have the minimal impact on the other rules

Basics of rule hiding conf(X=>Y) = sup(XY)/sup(X) Decreasing the confidence of a rule can be done by: Increasing the support of X in transactions not supporting Y Decreasing the support of Y in transactions supporting both X and Y Decreasing support of rule can be done by Decreasing the support of the corresponding large itemset XY

Min. support 20% Min. confidence 80%

Hiding AB->C by increasing support of AB

Hiding AB->C by decreasing support of ABC

Hiding AB->C by decreasing the support of C

Rule Hiding by Fuzzification In some applications where publishing wrong data is not acceptable, then unkown values may be inserted to blur the rules. When unknowns values are inserted, support and confidence values would fall into a range instead of a fixed value. Similar heuristics for rule hiding can be employed to minimize the side effects

Support and confidence Becomes a range of values

Classification model as a threat to privacy Document classification for authorship identification Main idea: based on a database of documents and authors, assign the most probable author to a new document It’s a possible threat to privacy when the text needs to stay anonymous

The fact that each author uses a characteristic frequency distribution over words and phrases helps us Feature representation used: T: total number of tokens V: total number of types C: total number of characters Classify the document by a learning algorithm and then try to perturb the classification Classification model as a threat to privacy

Another Motivating Application Given a set of attribute values that are confidential and therefore downgraded by inserting unknown values for the place of actual ones before being released. Can someone build a classification model using the rest of the attributes to predict the hidden value?

Classification Models as a threat to privacy How do we prevent a row to be classified as class C by perturbing the data. Main challenge is that (unlike association rule mining) resulting classification models depend on the technique, selected training data, and pruning methodology. Purpose is to decrease the accuracy of the classification model. Approach : inserting unknown values to the selected attribute values in the rest of the database.

Mining the data without actually seeing it Things that we need to consider are: Data type Data mining technique Data distribution Centralized Distributed (vertically or horizontally)

Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy- Preserving Data Mining”. SIGMOD, 2000, Dallas, TX. They developed a technique for consturcting a classification model on perturbed data. The data is assumed to be stored in a centralized database And it is outsourced to a third party for mining, therefore the confidential values need to be handled The following slides are based on the slides by the authors of the paper above Classification on perturbed data

Reconstruction Problem Original values x 1, x 2,..., x n from probability distribution X (unknown) To hide these values, we use y 1, y 2,..., y n from probability distribution Y Given x 1 +y 1, x 2 +y 2,..., x n +y n the probability distribution of Y Estimate the probability distribution of X.

Intuition (Reconstruct single point) Use Bayes' rule for density functions

Reconstructing the Distribution Combine estimates of where point came from for all the points: Gives estimate of original distribution.

Reconstruction: Bootstrapping f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := (Bayes' rule) j := j+1 until (stopping criterion met)

Shown to work in experiments on large data sets.

Algorithms “Global” Algorithm Reconstruct for each attribute once at the beginning “By Class” Algorithm For each attribute, first split by class, then reconstruct separately for each class. See SIGMOD 2000 paper for details.

Experimental Methodology Compare accuracy against – Original: unperturbed data without randomization. – Randomized: perturbed data but without making any corrections for randomization. Test data not randomized. Synthetic data benchmark. Training set of 100,000 records, split equally between the two classes.

Quantifying Privacy Add a random value between -30 and +30 to age. If randomized value is 60 know with 90% confidence that age is between 33 and 87. Interval width  amount of privacy. Example: (Interval Width : 54) / (Range of Age: 100)  54% randomization level @ 90% confidence

Privacy Preserving Distributed Data Mining Consider the case where data is distributed horizontally or vertically to multiple sites. Each site is autonomous and does not want to share their actual data Lets consider the following scenario: There are multiple hospitals that have their own local database, and they would like to participate in a scientific study that will analyze the results of treatements for different patients The privacy concern here is that, a hospital would not like to share the knowledge unless the other site also has it, to protect the privacy of itself and its operation Another scenario: Two bookstores would like to learn what books are sold together so that they make some offers to their companies (Amazon does that actually)

Case study: Association rules How do we mine association rules from distributed sources while preserving the privacy of the data owners? The confidential information in this case is: The data itself The fact that a local site supports a rules with certain confidence and certain support (No company wants to loose competitive advantage, and would not like to reveal anything if it will not benefit from the release of the data) Privacy preserving distributed association rule mining methods use distributed rule mining techniques

Distributed rule mining We know how rules are mined from centralized databases The distributed scenario is similar Consider that we have only two sites S1 and S2, which have databases D1 (with 3 transactions) and D2 (with 5 transactions)

Distributed rule mining We would like to mine the databases as if they are parts of a single centralized database of 8 transactions In order to do this, we need to calculate the local supports For example the local support of A in D1 is 100% The local support of the itemset {A,B,C} in D1 is 66%, and the local support of {A,B,C} in D2 is 40%.

Distributed rule mining Assume that the minimum support threshold is 50% then {A,B,C} is frequent in D1, but it is not frequent in D2. However when we assume that the databases are combined then the support of {A,B,C} in D1 U D2 is 50% which means that an itemset could be locally frequent in one database, but not frequent in another database. And it can be frequent globally In order for an itemset ot be frequent globally, it should be frequent in at least one database

Distributed rule mining The algorithm is based on apriori which prunes the rules by looking at the support Apriori also uses the fact that an itemset is frequent only if all its subsets are frequent Therefore only frequent itemsets should be used to generated larger frequent itemsets

Distributed rule mining 1)The local sites will find their frequent itemsets. 2)They will broadcast the frequent itemsets to each other 3)Individual sites will count the frequencies of the itemsets in their local database 4)They will broadcast the result to every site 5)Every site can now find globally frequent itemsets

Distributed rule mining Ex: 50% min supp threshold ● We will start from a singletons and calculate the frequencies of items ● In D1 A (freq 3), B (freq 2), C (freq 3) are frequent, in D2 A (freq 4), B (freq 3), C (freq 3) are frequent ● They will broadcast the results to each other and each site will update the counts of A, B, C by adding the local counts

Distributed rule mining Ex: 50% min supp threshold ● Each site will eliminate the items that are not globally frequent. In this case all of A, B, C are globally frequent. Now ● Now using the frequent items, each site will generate candidates of size 2 which are {A,B}, {A,C}, {B,C} ● And the same steps will be applied

Now we would like to do the same thing but preserve the privacy of the individual sites The basic notions we need for that are Commutative encryption And Secure multi-party computation An encryption is commutative if the following two equations hold for any given feasible encryption keys K1, K2,... Kn, any M, and any permutations of i,j E Ki1 (... E Kin (M)) = EK Kj1 (...E kjn (M)) For different M1, and M2 the probablity of collusion is very low RSA is a famous commutative encryption technique

A simple application of commutative encryption Assume that person A has salary S1, and person B has salary S2. How can they know wheather their salaries are equal to each other? (without revealing their salaries) Assume that A, and B have their own encryption keys, say K1, and K2. And we go from there!

Distributed PP Association Rule Mining For distributed association rule mining, each site needs to distribute its locally frequent itemsets to the rest of the sites Instead of circulating the actual itemsets, the ecrypted versions are circulated Example: S1 contains A, S2 contains B, S3 contains A. Each of them have their own keys, K1, K2, K3. At the end of step 1, each all sites will have items encrypted by all sites. The encrypted items are then passed to a common site to eliminate the duplicates and to start decryption. This was they will not know who has sent which item. Decryption can now start and after everybody finished decrypting, then they will have the actual items.

Distributed PP Association Rule Mining Now we need to see if the global support of an item is larger than the threshold. We we do not want to reveal the supports, since support of an item is assumed to be confidential. A secure multi-party computation technique is utilized for this Assume that there are three sites, and each of them has {A,B,C} and freq in S1 is 5 (out of 100 transactions), in S2 is 6 (out of 300), and in S3 20 (out of 300), and minimum support is 5%. S1 selects a random number, say 17 S1 adds the difference 5 – 5%x100 to 17 and sends the result (17) to S2 S2 adds 6 – 5%x200 to 17 and sends the result (13) to S3. S3 adds 20 – 5%x300 to 13 and sends the result (18) back to S1 18 > the chosen random number (17), so {A,B,C} is globally frequent.

Distributed PP Association Rule Mining This technique assumes a semi-honest model Where each party follows the rules of the protocol using its correct input, but it is free to later use what it sees during execution of the protocol to compromise security. Cost of encryption is the key issue since it is heavily used in this method.

Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Similar presentations

Presentation on theme: "Privacy-Preserving Databases and Data Mining Yücel SAYGIN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Similar presentations

Presentation on theme: "Privacy-Preserving Databases and Data Mining Yücel SAYGIN"— Presentation transcript:

Similar presentations

About project

Feedback