SplitX: High-Performance Private Analytics Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS)

Slides:



Advertisements
Similar presentations
Querying Encrypted Data using Fully Homomorphic Encryption Murali Mani, UMFlint Talk given at CIDR, Jan 7,
Advertisements

I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System ` Introduction With the deployment of smart card automated.
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Building web applications on top of encrypted data using Mylar Presented by Tenglu Liang Tai Liu.
Ragib Hasan Johns Hopkins University en Spring 2011 Lecture 8 04/04/2011 Security and Privacy in Cloud Computing.
Remote Procedure Call (RPC)
White-Box Cryptography
Non-tracking Web Analytics Istemi Ekin Akkus 1, Ruichuan Chen 1, Michaela Hardt 2, Paul Francis 1, Johannes Gehrke 3 1 Max Planck Institute for Software.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Extensible Networking Platform IWAN 2005 Extensible Network Configuration and Communication Framework Todd Sproull and John Lockwood
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Differentially Private Aggregation of Distributed Time-Series Vibhor Rastogi (University of Washington) Suman Nath (Microsoft Research)
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
Privacy-Preserving Cross-Domain Network Reachability Quantification
On-The-Fly Verification of Rateless Erasure Codes Max Krohn (MIT CSAIL) Michael Freedman and David Mazières (NYU)
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Privacy-Preserving Computation and Verification of Aggregate Queries on Outsourced Databases Brian Thompson 1, Stuart Haber 2, William G. Horne 2, Tomas.
Public Key Encryption that Allows PIR Queries Dan Boneh 、 Eyal Kushilevitz 、 Rafail Ostrovsky and William E. Skeith Crypto 2007.
Lecture 21: Privacy and Online Advertising. References Challenges in Measuring Online Advertising Systems by Saikat Guha, Bin Cheng, and Paul Francis.
CrowdLogging: Distributed, private, and anonymous search logging Henry Feild James Allan Joshua Glatt Center for Intelligent Information Retrieval University.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Database Access Control & Privacy: Is There A Common Ground? Surajit Chaudhuri, Raghav Kaushik and Ravi Ramamurthy Microsoft Research.
Efficient Privilege De-Escalation for Ad Libraries in Mobile Apps Bin Liu (SRA), Bin Liu (CMU), Hongxia Jin (SRA), Ramesh Govindan (USC)
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Digital Cash By Gaurav Shetty. Agenda Introduction. Introduction. Working. Working. Desired Properties. Desired Properties. Protocols for Digital Cash.
1 Secure Cooperative MIMO Communications Under Active Compromised Nodes Liang Hong, McKenzie McNeal III, Wei Chen College of Engineering, Technology, and.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
I Do Not Know What You Visited Last Summer: Protecting users from stateful third-party web tracking with TrackingFree browser Xiang Pan §, Yinzhi Cao †,
Secure Incremental Maintenance of Distributed Association Rules.
Privacy-Aware Personalization for Mobile Advertising
Wai Kit Wong 1, Ben Kao 2, David W. Cheung 2, Rongbin Li 2, Siu Ming Yiu 2 1 Hang Seng Management College, Hong Kong 2 University of Hong Kong.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Wai Kit Wong, Ben Kao, David W. Cheung, Rongbin Li, Siu Ming Yiu.
Cohesion and Coupling CS 4311
Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.
ACM CCS 2005 CPOL: High-Performance Policy Evaluation Kevin Borders Xin Zhao Atul Prakash University of Michigan.
Non-tracking Web Analytics Istemi Ekin Akkus, Ruichuan Chen, Michaela Hardt, Paul Francis, Johannes Gehrke Presentation by David Ferreras.
Privacy Framework for RDF Data Mining Master’s Thesis Project Proposal By: Yotam Aron.
WSV Problem Background 3. Accelerated Protocols and Workloads 4. Deployment and Management 2. BranchCache Solution Modes 5. BranchCache Protocols.
Wei-Shinn Ku Slide 1 Auburn University Computer Science and Software Engineering Query Integrity Assurance of Location-based Services Accessing Outsourced.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Data Integrity Proofs in Cloud Storage Author: Sravan Kumar R and Ashutosh Saxena. Source: The Third International Conference on Communication Systems.
Database Access Control IST2101. Why Implementing User Authentication? Remove a lot of redundancies in duplicate inputs of database information – Your.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Vladimir Kolesnikov (Bell Labs) Steven M. Bellovin, Seung Geol Choi, Ben Fisch, Wesley George, Angelos Keromytis, Fernando Krell, Abishek Kumarasubramanian,
Presented By Amarjit Datta
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
Traffic Correlation in Tor Source and Destination Prediction PETER BYERLEY RINDAL SULTAN ALANAZI HAFED ALGHAMDI.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Secure Data Outsourcing
Auditing Information Leakage for Distance Metrics Yikan Chen David Evans TexPoint fonts used in EMF. Read the TexPoint manual.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Department of Computer Science Chapter 5 Introduction to Cryptography Semester 1.
Data Security and Privacy Keke Chen
When small data is better data
563.10: Bloom Cookies Web Search Personalization without User Tracking
Designing Private Forums
Differential Privacy in Practice
Interpret the execution mode of SQL query in F1 Query paper
Published in: IEEE Transactions on Industrial Informatics
Presentation transcript:

SplitX: High-Performance Private Analytics Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS)

Data analytics is important Evaluate system performance Understand user behavior Discover statistical patterns

Data exposure has become a major concern Third-party Trackers Smart-phone Apps

User-owned and operated Data exposure has to be brought under control! User-owned and operated principle  Personal data should be stored in a local host under the user’s control.

Motivation and problem How to make aggregate queries over distributed private user data while still preserving user privacy? Data Analyst

Outline Related work SplitX system  Key insights  System design  Performance comparison  Implementation & deployment Conclusion

A general approach Based on differential privacy. Differential privacy adds noise to the output of a computation (i.e., query).  Hide the presence or absence of a user. Database Query Module (add noise) Analyst Data

Previous systems Servers aggregate answers without seeing individual user data. Differentially private noise is added to the aggregate result. Data Analyst Servers Analyst Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Primary technical problems Scale poorly  Require public-key operations or something even more expensive. Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11 Suffer from answer pollution  Even a single malicious user can substantially distort the aggregate result through a single answer. Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Outline Related work SplitX system  Key insights  System design  Performance comparison  Implementation & deployment Conclusion

SplitX A high-performance private analytics system  2 to 3 orders of magnitude more efficient in bandwidth  3 to 5 orders of magnitude more efficient in computation  Resistant to answer pollution

Components & assumptions Data Analyst Servers (1 aggregator and 2 mixes) Analysts are potentially malicious (violating user privacy) Clients are user devices. Clients are potentially malicious (distorting the final results) Servers are honest but curious 1) Follow the specified protocol 2) Try to exploit additional info that can be learned in so doing Analyst

Outline Related work SplitX system  Key insights  System design  Performance comparison  Implementation & deployment Conclusion

Key insights: XOR encryption How to achieve high performance? Client wants to send M to aggregator  Client splits M, and sends split messages to aggregator via mixes  Aggregator joins split messages to recreate M AggregatorClient Mix2 Mix1 M R RR M generate Rrecreate M

Key insights: XOR encryption How to achieve high performance? M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2. For clarity AggregatorClient Mix2 Mix1 M R RR AggregatorClient Mix2 Mix1 M generate Rrecreate M

Key insights: query buckets How to limit answer pollution? Solution:  Ensure that a client cannot arbitrarily manipulate answers.  Divide answer’s value range into buckets.  Enforce a binary answer in each bucket.

Key insights: query buckets Query: “SELECT age FROM splitx”  4 buckets: 0~19, 20~39, 40~59, and ≥60.  Answers: a ‘1’ or ‘0’ per bucket. 30 years-old  0, 1, 0, 0  Answers encoded in a bit-vector.  An answer from a malicious client cannot substantially distort the query result!

Outline Related work SplitX system  Key insights  System design  Performance comparison  Implementation & deployment Conclusion

System design 1) Query publish/subscribe  Analyst publishes its queries  Client subscribes to an analyst’s queries 2) Query answering  Client answers queries  Mixes add differentially private noise  Mixes shuffle answers  Aggregator generates query results

1) Query publish/subscribe AggregatorClient Mix2 Mix1 Query1, Query2, … Analyst Analyst ID Query1, Query2, …

1) Query publish/subscribe Query example: age distribution among male users?  QID:  SQL:  Buckets:  DP parameter ( ):  T end : :59:59PM on Aug 16, ~19, 20~39, 40~59, and ≥ SELECT age FROM splitx WHERE gender=‘male’

2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results

Step 1: client answers queries Client executes query over its local data and generates an answer  ‘1’ or ‘0’ per bucket  Encoded as a bit-vector

Step 1: client answers queries Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively. AggregatorClient Mix2 Mix1 Analyst QID, answer Mix knows which query a client answered. Privacy violation!

Step 2: mixes add DP noise  Each mix individually adds some random bit-vectors as the differentially private noise  How many bit-vectors needed? c: # clients queried : DP parameter Mix …… 0111 …… Mix …… 0101 …… Mix …… Mix …… random bit-vectors as noise

Step 3: mixes shuffle split answers  Each mix maintains c+n split answers  Mixes shuffle the split answers for each column (i.e., bucket) in a synchronized way. Mix …… 0111 …… Mix …… 0101 …… Mix …… Mix …… shuffle

Mixes transmit shuffled answers Each mix transmits the shuffled split answers to the aggregator. AggregatorClient Mix2 Mix1 Analyst Mix1 …… Mix2 …… c+n shuffled split answers

Step 4: aggregator generates query result  Join each bit position in the two split answer arrays.  Sum up the values for each bucket.  Obtain the noisy count for each bucket. Mix …… 0100 …… Mix …… 0001 …… Agg …… 0101 …… =

Privacy issue at the mixes Client splits the answer, and sends the split answers with the query ID to the two mixes  Mix knows which query a specific client answered! AggregatorClient Mix2 Mix1 Analyst QID, answer

Solution: double-splitting Client Mix2 Mix1 Mix2 Aggregator Client Mix2 Mix1 Analyst QID, answer

Duplicate answer detection A client can answer a query many times!  How to detect and remove duplicate answers?  Triple-splitting is needed  Section 5 in the paper.

Outline Related work SplitX system  Key insights  System design  Performance comparison  Implementation & deployment Conclusion

Computational overhead Three to five orders of magnitude more efficient in computation than previous systems PDDP [NSDI’12] Akkus et al. [CCS’12] – “A” is #buckets that a client reports

Implementation Client side  Google Chrome extension  Capture webpages browsed, searches made, extensions installed Server side (mix + aggregator)  Web services on Jetty  RPCs defined in Thrift language

Deployment Query results from a 416-client deployment  Most visited websites: google, facebook, youtube  Most used apps: gmail, youtube, google drive  91% of clients made ≤50 searches / day  70% of clients visited >50 webpages / day  97% of clients visited ≤100 websites / day

Conclusion SplitX: a high-performance private analytics system  Orders of magnitude more efficient than previous systems  Resistant to answer pollution Key insights  XOR-based encryption  Query buckets