A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Beyond Convexity – Submodularity in Machine Learning

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.

A Privacy Preserving Index for Range Queries

Optimizing Recommender Systems as a Submodular Bandits Problem Yisong Yue Carnegie Mellon University Joint work with Carlos Guestrin & Sue Ann Hong.

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

1 Regret-based Incremental Partial Revelation Mechanism Design Nathanaël Hyafil, Craig Boutilier AAAI 2006 Department of Computer Science University of.

Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Indian Statistical Institute Kolkata

Carnegie Mellon Selecting Observations against Adversarial Objectives Andreas Krause Brendan McMahan Carlos Guestrin Anupam Gupta TexPoint fonts used in.

Beyond Keyword Search: Discovering Relevant Scientific Literature Khalid El-Arini and Carlos Guestrin August 22, 2011 TexPoint fonts used in EMF. Read.

Efficient Informative Sensing using Multiple Robots

Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.

PROBLEM BEING ATTEMPTED Privacy -Enhancing Personalized Web Search Based on:  User's Existing Private Data Browsing History s Recent Documents 

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.

CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti Ravi Kumar Kunal Punera

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.

1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Near-optimal Sensor Placements: Maximizing Information while Minimizing Communication Cost Andreas Krause, Carlos Guestrin, Anupam Gupta, Jon Kleinberg.

Coordinated Sampling sans Origin-Destination Identifiers: Algorithms and Analysis Vyas Sekar, Anupam Gupta, Michael K. Reiter, Hui Zhang Carnegie Mellon.

Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.

A Study of Computational and Human Strategies in Revelation Games 1 Noam Peled, 2 Kobi Gal, 1 Sarit Kraus 1 Bar-Ilan university, Israel. 2 Ben-Gurion university,

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

AdWords Instructor: Dawn Rauscher. Quality Score in Action 0a2PVhPQhttp:// 0a2PVhPQ.

From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,

Privacy-Aware Personalization for Mobile Advertising

Eric Horvitz Tadayoshi Kohno Frank McSherry Wendy Seltzer Daniel Weitzner.

Toward Community Sensing Andreas Krause Carnegie Mellon University Joint work with Eric Horvitz, Aman Kansal, Feng Zhao Microsoft Research Information.

COOKIES. INTERNET COOKIES What are they Where are they found What should you do about them.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Online Algorithms By: Sean Keith. An online algorithm is an algorithm that receives its input over time, where knowledge of the entire input is not available.

Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.

Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.

1 (One-shot) Mechanism Design with Partial Revelation Nathanaël Hyafil, Craig Boutilier IJCAI 2007 Department of Computer Science University of Toronto.

Using and modifying plan constraints in Constable Jim Blythe and Yolanda Gil Temple project USC Information Sciences Institute

BARD / April BARD: Bayesian-Assisted Resource Discovery Fred Stann (USC/ISI) Joint Work With John Heidemann (USC/ISI) April 9, 2004.

Diversifying Search Results Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, Samuel Ieong Search Labs, Microsoft Research WSDM, February 10, 2009 TexPoint.

5 Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable!

Unit 1 Welcome to the Internet: the Tools of the Trade.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.

Learning Profiles from User Interactions

University of Texas at El Paso

Monitoring rivers and lakes [IJCAI ‘07]

Near-optimal Observation Selection using Submodular Functions

CS b659: Intelligent Robotics

Distributed Submodular Maximization in Massive Datasets

Personalizing Search on Shared Devices

Structured Learning of Two-Level Dynamic Rankings

Cost-effective Outbreak Detection in Networks

Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz

Stochastic Privacy Adish Singla Eric Horvitz Ece Kamar Ryen White

Presentation transcript:

A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft Research Joint work with Eric Horvitz Microsoft Research 23 rd Conference on Artificial Intelligence | July 16, 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A

2 Value of private information to enhancing search Personalized web search is a prediction problem: “Which page is user X most likely interested in for query Q?” The more information we have about the user, the better services can be provided to users Users are reluctant to share private information (or don’t want search engines to log data) We apply utility theoretic methods to optimize tradeoff: Getting the biggest “bang” for the “personal data buck”

3 Utility theoretic approach Sharing personal information (topic interests, search history, IP address etc.) Utility of knowing Sensitivity of sharing – Net benefit to user =

4 Utility theoretic approach Sharing more information might decrease net benefit Utility of knowing Sensitivity of sharing – Net benefit to user =

5 Maximizing the net benefit How can we find optimal tradeoff maximizing net benefit? ? Net benefit Share no information Share much information

6 Trading off utility and privacy Set V of 29 possible attributes (each · 2 bits) Demographic data (location) Query details (working hours / week day?) Topic interests (ever visited business / science / … website) Search history (same query / click before / searches/day?) User behavior (ever changed Zip, City, Country)? For each A µ V compute utility U(A) and cost C(A) Find A maximizing U(A) while minimizing C(A)

7 Estimating utility U(A) of sharing data Ideally: how does knowing A help increase the relevance of displayed results? Very hard to estimate from data  Proxy [ Mei and Church ’06, Dou et al ‘07 ] : Click entropy! Learn probabilistic model for P( C | Q, A) = P( click | query, attributes ) U(A) = H( C | Q ) – H( C | Q, A ) Entropy before revealing attributes Entropy after revealing attributes E.g.: A = {X 1, X 3 } U(A) = 1.3 C Search goal X 1 Age X 2 Gender X 3 Country Q Query

8 Click entropy example U(A) = expected click entropy reduction knowing A Query: sports Pages Freq Country: USA Entropy H = 2.6H = 1.7 Entropy Reduction: 0.9 C Search goal X 1 Age X 2 Gender X 3 Country Q Query

9 Study of Value of Personal Data Estimate click entropy from volunteer search log data. ~15,000 users Only frequent queries ( ¸ 30 users) Total ~250,000 queries during 2006 Example: Consider topics of prior visits, V = {topic_arts,topic_kids} Query: “cars”, prior entropy: 4.55 U({topic_arts}) = 0.40 U({topic_kids}) = 0.41 How does U(A) increase as we pick more attributes A?

10 noneATLVTHOMACTYTGAMTSPTAQRYACLKAWDYAWHRTCINTADTDREGTKIDAFRQTSCITHEATNWSTCMPACRYTREF Diminishing returns for click entropy The more attributes we add, the less we gain in utility Theorem: Click entropy U(A) is submodular!* A*: Search activity T*: Topic interests More utility (entropy reduction) More private attributes (greedily chosen) *See store for details

11 Trading off utility and privacy Set V of 29 possible attributes (each · 2 bits) Demographic data (location) Query details (working hours / week day?) Topic interests (ever visited business / science / … website) Search history (same query / click before / searches/day?) User behavior (ever changed Zip, City, Country)? For each A µ V compute utility U(A) and cost C(A) Find A maximizing U(A) while minimizing C(A)

12 Getting a handle on cost Identifiability: “Will they know it’s me?” Sensitivity: “I don’t feel comfortable sharing this!”

13 Identifiability cost Intuition: The more attributes we already know, the more identifying it is to add another Goal: Avoid identifiability For example: k-anonymity [Sweeney ‘02], and others Age Gender Occupation

14 Predict person Y from attributes A Example: P(Y | gender = female, country = US) Define “loss” function [c.f., Lebanon et al.] Identifiability cost User Freq User Freq Good! Predicting user is hard.Bad! Predicting user is easy! Worst-case probability of detection

15 Identifiability cost The more attributes we add, the larger the increase in cost: Accelerating cost Theorem: Identifiability cost C(A) is supermodular!* noneTCMPAWDYAWHRAQRYACLKACRYTREGTWLDTARTTREFACTYTBUSTHEATRECAZIPTNWSTSPTTSHPTSOCAFRQTSCIATLVTKIDDREGTADTTHOMTGMSTCIN Less identifiability cost More private attributes (greedily chosen) *See store for details

16 Trading off utility and privacy Set V of 29 possible attributes (each · 2 bits) Demographic data (location) Query details (working hours / week day?) Topic interests (ever visited business / science / … website) Search history (same query / click before / searches/day?) User behavior (ever changed Zip, City, Country)? For each A µ V compute utility U(A) and cost C(A) Find A maximizing U(A) while minimizing C(A)

17 Trading off utility and cost Want: A* = argmax F (A) Optimizing value of private information is a submodular problem!  Can use algorithms for optimizing submodular functions: Goldengorin et al. (branch and bound), Feige et al. (approx. algorithm),.. Can efficiently get provably near-optimal tradeoff! - λ= U(A) C(A) F (A) Trade-off parameter UtilityCost Final objective noneocchomewhouragewdaygenderregbusworldadultartscountrycomprefkids Utility - Cost (Lazy) Greedy forward selection submodular supermodular submodular (non-monotonic) NP hard (and large: 2 29 subsets)

18 Finding the “sweet spot” Which λ should we choose? Tradeoff-curve purely based on log data. What do users prefer? More utility U(A) Less cost C(A) Want: A* = argmax U(A) - C(A)  = 1  = 0  = 1  = 10 “ignore cost” “ignore utility” Sweet spot! Maximal utility at maximal privacy

19 Survey for eliciting cost Microsoft internal online survey Distributed internationally N=1451 responses from 35 countries (80% US) Incentive: 1 Zune™ digital music player

20 Identifiability vs sensitivity

21 Sensitivity vs utility

22 Seeking a common currency Sensitivity acts as common currency to estimate utility-privacy tradeoff Region Country State City Zip Address Location Granularity Sensitivity

23 regioncountrystatecityzip Entropy reduction required Survey data (median) Identifiability cost (from search logs) regioncountrystatecityzip Entropy reduction required Survey data (median) Identifiability cost (from search logs) Survey data (median) Cost (maxprob) Utility (entropy reduction) = 100 = 10 = 1 Calibrating the tradeoff Can use survey data to calibrate utility privacy tradeoff! User preferences map into sweet spot! Best fit for λ = 5.12 F (A) = U(A) - λ C(A)

24 Understanding Sensitivities: “I don’t feel comfortable sharing this!”

25 Attribute sensitivities We incorporate sensitivity in our cost function by calibration Significant differences between topics!

26 Comparison with heuristics Optimized solution: Repeated visit / query, workday / working hour, top-level domain, avg. queries per day, topic: sports, topic: games Optimized tradeoff Search statistics (ATLV, AWDY, AWHR, AFRQ) All topic interests IP Address Bytes 1&2 IP Address Utility U(A) Cost C(A) Net Benefit F(A) More net benefit (bits of info.) Optimized solution outperforms naïve selection heuristics!

27 Summary Use of private information by online services as an optimization problem (with user permission /awareness) Utility (Click entropy) is submodular Privacy (Identifiability) is supermodular Can use theoretical and algorithmic tools to efficiently find provably near-optimal tradeoff Can calibrate tradeoff using user preferences Promising results on search logs and survey data!

28 s Selection A = {} Selection B = {X 2,X 3 } Adding X 1 will help a lot! Adding X 1 doesn’t help much New feature X 1 B A s + + Large improvement Small improvement For A µ B, U(A [ {s}) – U(A) ¸ U(B [ {s}) – U(B) Submodularity: C Search goal X 1 Age X 2 Gender X 3 Country C Search goal Theorem [based on Krause, Guestrin ’05] : Click entropy reduction is submodular! * *See store for details Diminishing returns for click entropy