SplitX: High-Performance Private Analytics Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS)
Data analytics is important Evaluate system performance Understand user behavior Discover statistical patterns
Data exposure has become a major concern Third-party Trackers Smart-phone Apps
User-owned and operated Data exposure has to be brought under control! User-owned and operated principle Personal data should be stored in a local host under the user’s control.
Motivation and problem How to make aggregate queries over distributed private user data while still preserving user privacy? Data Analyst
Outline Related work SplitX system Key insights System design Performance comparison Implementation & deployment Conclusion
A general approach Based on differential privacy. Differential privacy adds noise to the output of a computation (i.e., query). Hide the presence or absence of a user. Database Query Module (add noise) Analyst Data
Previous systems Servers aggregate answers without seeing individual user data. Differentially private noise is added to the aggregate result. Data Analyst Servers Analyst Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Primary technical problems Scale poorly Require public-key operations or something even more expensive. Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11 Suffer from answer pollution Even a single malicious user can substantially distort the aggregate result through a single answer. Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Outline Related work SplitX system Key insights System design Performance comparison Implementation & deployment Conclusion
SplitX A high-performance private analytics system 2 to 3 orders of magnitude more efficient in bandwidth 3 to 5 orders of magnitude more efficient in computation Resistant to answer pollution
Components & assumptions Data Analyst Servers (1 aggregator and 2 mixes) Analysts are potentially malicious (violating user privacy) Clients are user devices. Clients are potentially malicious (distorting the final results) Servers are honest but curious 1) Follow the specified protocol 2) Try to exploit additional info that can be learned in so doing Analyst
Outline Related work SplitX system Key insights System design Performance comparison Implementation & deployment Conclusion
Key insights: XOR encryption How to achieve high performance? Client wants to send M to aggregator Client splits M, and sends split messages to aggregator via mixes Aggregator joins split messages to recreate M AggregatorClient Mix2 Mix1 M R RR M generate Rrecreate M
Key insights: XOR encryption How to achieve high performance? M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2. For clarity AggregatorClient Mix2 Mix1 M R RR AggregatorClient Mix2 Mix1 M generate Rrecreate M
Key insights: query buckets How to limit answer pollution? Solution: Ensure that a client cannot arbitrarily manipulate answers. Divide answer’s value range into buckets. Enforce a binary answer in each bucket.
Key insights: query buckets Query: “SELECT age FROM splitx” 4 buckets: 0~19, 20~39, 40~59, and ≥60. Answers: a ‘1’ or ‘0’ per bucket. 30 years-old 0, 1, 0, 0 Answers encoded in a bit-vector. An answer from a malicious client cannot substantially distort the query result!
Outline Related work SplitX system Key insights System design Performance comparison Implementation & deployment Conclusion
System design 1) Query publish/subscribe Analyst publishes its queries Client subscribes to an analyst’s queries 2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
1) Query publish/subscribe AggregatorClient Mix2 Mix1 Query1, Query2, … Analyst Analyst ID Query1, Query2, …
1) Query publish/subscribe Query example: age distribution among male users? QID: SQL: Buckets: DP parameter ( ): T end : :59:59PM on Aug 16, ~19, 20~39, 40~59, and ≥ SELECT age FROM splitx WHERE gender=‘male’
2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
Step 1: client answers queries Client executes query over its local data and generates an answer ‘1’ or ‘0’ per bucket Encoded as a bit-vector
Step 1: client answers queries Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively. AggregatorClient Mix2 Mix1 Analyst QID, answer Mix knows which query a client answered. Privacy violation!
Step 2: mixes add DP noise Each mix individually adds some random bit-vectors as the differentially private noise How many bit-vectors needed? c: # clients queried : DP parameter Mix …… 0111 …… Mix …… 0101 …… Mix …… Mix …… random bit-vectors as noise
Step 3: mixes shuffle split answers Each mix maintains c+n split answers Mixes shuffle the split answers for each column (i.e., bucket) in a synchronized way. Mix …… 0111 …… Mix …… 0101 …… Mix …… Mix …… shuffle
Mixes transmit shuffled answers Each mix transmits the shuffled split answers to the aggregator. AggregatorClient Mix2 Mix1 Analyst Mix1 …… Mix2 …… c+n shuffled split answers
Step 4: aggregator generates query result Join each bit position in the two split answer arrays. Sum up the values for each bucket. Obtain the noisy count for each bucket. Mix …… 0100 …… Mix …… 0001 …… Agg …… 0101 …… =
Privacy issue at the mixes Client splits the answer, and sends the split answers with the query ID to the two mixes Mix knows which query a specific client answered! AggregatorClient Mix2 Mix1 Analyst QID, answer
Solution: double-splitting Client Mix2 Mix1 Mix2 Aggregator Client Mix2 Mix1 Analyst QID, answer
Duplicate answer detection A client can answer a query many times! How to detect and remove duplicate answers? Triple-splitting is needed Section 5 in the paper.
Outline Related work SplitX system Key insights System design Performance comparison Implementation & deployment Conclusion
Computational overhead Three to five orders of magnitude more efficient in computation than previous systems PDDP [NSDI’12] Akkus et al. [CCS’12] – “A” is #buckets that a client reports
Implementation Client side Google Chrome extension Capture webpages browsed, searches made, extensions installed Server side (mix + aggregator) Web services on Jetty RPCs defined in Thrift language
Deployment Query results from a 416-client deployment Most visited websites: google, facebook, youtube Most used apps: gmail, youtube, google drive 91% of clients made ≤50 searches / day 70% of clients visited >50 webpages / day 97% of clients visited ≤100 websites / day
Conclusion SplitX: a high-performance private analytics system Orders of magnitude more efficient than previous systems Resistant to answer pollution Key insights XOR-based encryption Query buckets