Differentially Private Recommendation Systems Jeremiah Blocki Fall 2009 18739A: Foundations of Security and Privacy.

Slides:



Advertisements
Similar presentations
I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Advertisements

By Klejdi Muca & Stephen Quinn. A method used by companies like IMDB or Netlfix to turn raw data into useful information, for example It helps companies.
Item Based Collaborative Filtering Recommendation Algorithms
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Ragib Hasan Johns Hopkins University en Spring 2011 Lecture 8 04/04/2011 Security and Privacy in Cloud Computing.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
The End of Anonymity Vitaly Shmatikov. Tastes and Purchases slide 2.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Adding Privacy to Netflix Recommendations Frank McSherry, Ilya Mironov (MSR SVC) Attacks on Recommender Systems — No “blending in”, auxiliary information.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
1 Preserving Privacy in Collaborative Filtering through Distributed Aggregation of Offline Profiles The 3rd ACM Conference on Recommender Systems, New.
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang †, Brian Thompson †, Hui Wang ‡, Danfeng Yao † †‡
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Defining and Achieving Differential Privacy Cynthia Dwork, Microsoft TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
Relational Database Concepts. Let’s start with a simple example of a database application Assume that you want to keep track of your clients’ names, addresses,
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
By Rachsuda Jiamthapthaksin 10/09/ Edited by Christoph F. Eick.
Privacy risks of collaborative filtering Yuval Madar, June 2012 Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten & V. Shmatikov.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
1 Recommender Systems Collaborative Filtering & Content-Based Recommending.
Netflix Netflix is a subscription-based movie and television show rental service that offers media to subscribers: Physically by mail Over the internet.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
EigenRank: A Ranking-Oriented Approach to Collaborative Filtering IDS Lab. Seminar Spring 2009 강 민 석강 민 석 May 21 st, 2009 Nathan.
PRISM: Private Retrieval of the Internet’s Sensitive Metadata Ang ChenAndreas Haeberlen University of Pennsylvania.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.
EigenRank: A ranking oriented approach to collaborative filtering By Nathan N. Liu and Qiang Yang Presented by Zachary 1.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Cosine Similarity Item Based Predictions 77B Recommender Systems.
Pearson Correlation Coefficient 77B Recommender Systems.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
The world’s libraries. Connected. Managing your Private and Public Data: Bringing down Inference Attacks against your Privacy Group Meeting in 2015.
Differential Privacy (1). Outline  Background  Definition.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Privacy Preserving in Social Network Based System PRENTER: YI LIANG.
Experimental Study on Item-based P-Tree Collaborative Filtering for Netflix Prize.
Company LOGO MovieMiner A collaborative filtering system for predicting Netflix user’s movie ratings [ECS289G Data Mining] Team Spelunker: Justin Becker,
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Item-Based Collaborative Filtering Recommendation Algorithms
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Collaborative Filtering Nearest Neighbor Approach
M.Sc. Project Doron Harlev Supervisor: Dr. Dana Ron
Advanced Artificial Intelligence
Q4 : How does Netflix recommend movies?
Differential Privacy (2)
Ensembles.
ITEM BASED COLLABORATIVE FILTERING RECOMMENDATION ALGORITHEMS
Collaborative Filtering Non-negative Matrix Factorization
Published in: IEEE Transactions on Industrial Informatics
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Presentation transcript:

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy

Netflix $1,000,000 Prize Competition Queries: On a scale of 1 to 5 how would John rate “The Notebook” if he watched it? User/Movie….300The Notebook…. …………… John4Unrated MaryUnrated Sue25 Joe51 ……………

Netflix Prize Competition Note: N x M table is very sparse (M = 17,770 movies, N = 500,000 users) To Protect Privacy: Each user was randomly assigned to a globally unique ID Only 1/10 of the ratings were published The ratings that were published were perturbed a little bit User/Movie….13,53713,538…. …………… 258,964(4, 10/11/2005)Unrated 258,965Unrated 258,966(2, 6/16/2005)(5, 6/18/2005) 258,967(5, 9/15/2005)(1,4/28/2005) ……………

- predicted ratings - actual ratings

Netflix Prize Competition Goal: Make accurate predictions as measured by Root Mean Squared Error (RMSE) - predicted ratings - actual ratings AlgorithmRMSE BellKor's Pragmatic Chaos < Challenge: 10% Improvement Netflix’s Cinematch (Baseline)0.9525

Outline  Breaking the anonymity of the Netflix prize dataset  Recap: Differential Privacy and define Approximate Differential Privacy  Prediction Algorithms  Privacy Preserving Prediction Algorithms  Remaining Issues

Netflix FAQ: Privacy Q: “Is there any customer information in the dataset that should be kept private?” A: “No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy [... ] Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?” Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

De-Anonymizing the Netflix Dataset IMDB The Internet Movie Database (IMDB) is a huge source of auxiliary information! Challenge Even the same user, may not provide the same ratings for IMDB and Netflix. The dates also may not correspond exactly. Adversary Knows Approximate Ratings (±1) Knows Approximate Dates (14-day error) Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

De-Anonymizing the Netflix Dataset  Does anyone really care about the privacy of movie ratings?  Video Privacy Protection Act of 1988 exists, so some people do. Video Privacy Protection Act  In some cases knowing movie ratings allows you to infer an individual’s political affiliation, religion, or even sexual orientation. Source: How to Break Anonymity of the Netflix Prize Dataset (Narayanan and Shmatikov)

Privacy in Recommender Systems  Netflix might base its recommendation to me on both:  My own rating history  The rating history of other users  Maybe it isn’t such a big deal if the recommendations Netflix makes to me leak information about my own history, but they should preserve the privacy of other users  It would be possible to learn about someone else’s ratings history  If we have auxiliary information  Active attacks are a source auxiliary information Calandrino, et al. Don’t review that book: Privacy risks of collaborative filtering, 2009.

Recall Differential Privacy [Dwork et al 2006] Randomized sanitization function κ has ε- differential privacy if for all data sets D1 and D2 differing by at most one element and all subsets S of the range of κ, Pr[ κ (D1)  S ] ≤ e ε Pr[ κ (D2)  S ]

Achieving Differential Privacy  How much noise should be added?  Intuition: f(D) can be released accurately when f is insensitive to individual entries x 1, … x n (the more sensitive f is, higher the noise added) Tell me f(D) f(D)+noise x1…xnx1…xn Database D User

Sensitivity of a function  Examples:  count  1,  histogram  1  Note:  f does not depend on the database

Sensitivity with Laplace Noise

Approximate Differential Privacy Randomized sanitization function κ has ( ε, δ )-differential privacy if for all data sets D1 and D2 differing by at most one element and all subsets S of the range of κ, Pr[ κ (D1) ∈ S] ≤ e ε Pr[ κ (D2)  S ] + δ Source: Dwork et al 2006

Approximate Differential Privacy  One interpretation of approximate differential privacy is that the outcomes of the computation κ are unlikely to provide much more information than ε-differential privacy, but it is possible.  Key Difference: Approximate Differential Privacy does NOT require that: Range(κ (D1)) = Range( κ (D2))  The privacy guarantees made by (ε,δ)-differential privacy are not as strong as ε-differential privacy, but less noise is required to achieve (ε,δ)-differential privacy. Source: Dwork et al 2006

Achieving Approximate Differential Privacy Key Differences: Use of the L2 instead of L1 norm to define the sensitivity of Δf Use of Gaussian Noise instead of Laplacian Noise Source: Differentially Private Recommender Systems(McSherry and Mironov)

Differential Privacy for Netflix Queries  What level of granularity to consider? What does it mean for databases D1 and D2 to differ on at most one element?  One user (column) is present in D1 but not in D2  One rating (cell) is present in D1 but not in D2  Issue 1: Given a query “how would user i rate movie j?” Consider: K(D-u[i]) - how can it possibly be accurate?  Issue 2: If the definition of differing in at most one element is taken over cells, then what privacy guarantees are made for a user with many data points?

Netflix Predictions – High Level  Q(i,j) – “How would user i rate movie j?”  Predicted rating may typically depend on  Global average rating over all movies and all users  Average movie rating of user i  Average rating of movie j  Ratings user i gave to similar movies  Ratings similar users gave to movie j  Intuition: Δf may be small for many of these queries

Personal Rating Scale  For Alice a rating of 3 might mean the movie was really terrible.  For Bob the same rating might mean that the movie was excellent.  How do we tell the difference?

How do we tell if two users are similar? Pearson’s Correlation is one metric for similarity of users i and j Consider all movies rated by both users Negative value whenever i likes a movie that j dislikes Positive value whever i and j agreee We can use similar metrics to measure the similarity between two movies.

Netflix Predictions Example  Collaborative Filtering  Find the k-nearest neighbors of user i who have rated movie j by Pearson’s Correlation:  Predicted Rating similarity of users i and j k most similar users

Netflix Prediction Sensitivity Example  Pretend the query Q(i,j) included user i’s rating history  At most one of the neighbors ratings changes, and the range of ratings is 4. The L1 sensitivity of the prediction is:  p = 4/k

Similarity of Two Movies  Let U be the set of all users who have rated both movies i and j then

K-Nearest Users or K-Nearest Movies? Find k most similar users to i that have also rated movie j? Find k most similar movies to j that user i has rated? Either way, after some pre-computation, we need to be able to find the k-nearest users/movies quickly!

Covariance Matrix (MxM) matrix Cov[i][j] measures similarity between movies i and j M ≈ 17,000 Movie-Movie Covariance Matrix (NxN) Matrix to measure similarity between users N ≈ 500,000 so this is difficult to pre-compute or even store in memory! User-User Covariance Matrix?

Useful Statistics Global average rating Average rating for movie j User Ratings Re-centered about average Clamped (B=1) Source: Differentially Private Recommender Systems(McSherry and Mironov)

What do we need to make predictions? For a large class of prediction algorithms it suffices to have:  Gavg  Mavg – for each movie  Average Movie Rating for each user  Movie-Movie Covariance Matrix (COV)

Differentially Private Recommender Systems (High Level) To respect approximate differential privacy publish  Gavg + NOISE  Mavg + NOISE  COV + NOISE  Don’t publish average ratings for users   Gavg,  Mavg are very small so they can be published with little noise   COV requires more care. Source: Differentially Private Recommender Systems(McSherry and Mironov)

Covariance Matrix Tricks  Trick 1: Center and clamp all ratings around averages. If we use clamped ratings then we reduce the sensitivity of our function.

Covariance Matrix Tricks  Trick 2: Carefully weight the contribution of each user to reduce the sensitivity of the function. Users who have rated more movies are assigned lower weight.

Publishing the Covariance Matrix  Theorem*: Pick then * Some details about stabilized averages are being swept under the rug

Useful Statistics for Predictions Sum of all the ratings in the dataset Total Number of ratings in dataset Sum of all the ratings for movie j Number of ratings for movie j

Covariance Matrix  The ratings vectors are adjusted  Centered about individual averages  Clamped (to reduce noise)

Results Source: Differentially Private Recommender Systems(McSherry and Mironov)

Note About Results  Granularity: One rating present in D1 but not in D2  Accuracy is much lower when one user is present in D1 but not in D2  Intuition: Given query Q(i,j) the database D-u[i] gives us no history about user i.  Approximate Differential Privacy  Gaussian Noise added according to L2 Sensitivity  Clamped Ratings (B =1) to further reduce noise

Questions  Could the results be improved if the analysis was tailored to more specific prediction algorithms?  What if new movies are added? Can the movie-movie covariance matrix be changed without comprising privacy guarantees?  How strong are the privacy guarantees given that there are potentially many data points clustered around a single user?