Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2015 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Special topics December 3, 2015.

Similar presentations


Presentation on theme: "© 2015 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Special topics December 3, 2015."— Presentation transcript:

1 © 2015 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Special topics December 3, 2015

2 © 2015 A. Haeberlen Announcements Please complete the course evaluation form! Your feedback will help me improve NETS212 and will benefit next year's class Second midterm on Tuesday next week Comparable to first midterm (open-book, closed-Google,...) Covers all material up to, and including, today's lecture Location: Heilmeier Hall; time: 4:30-6:00pm as usual Project: Any questions? You should have a first running prototype by the end of this week If you're 'stuck' on something or would like feedback on your design, please see me (or one of the TAs) during office hours!! 2 University of Pennsylvania

3 © 2015 A. Haeberlen Second midterm For example, you should be able to: Implement a simple graph algorithm in (iterative) MapReduce Explain algorithms like adsorption, PageRank, SSSP,... Make and justify tradeoffs between different designs, e.g., SQL vs MapReduce, unstructured vs structured overlays,... Identify common vulnerabilities and suggest defenses Choose a suitable representation for various types of data Write simple queries, e.g., in SQL Example questions at the end of this lecture 3 University of Pennsylvania

4 © 2015 A. Haeberlen Reminder: Facebook award The teams with the best PennBook applications will receive an award (sponsored by ) Criteria: Architecture/design, supported features, stability, performance, security, scalability, deployment Award #1: Backpacks (+contents); #2: Duffel bags (+contents) Winners will be announced on the course web page 4 University of Pennsylvania

5 © 2015 A. Haeberlen Administrative domains Cloud services are distributed systems with multiple administrative domains Alice controls the software running on the cloud, Bob controls the cloud hardware Neither has full control or full information They have to cooperate to make the system work 5 AliceBob Alice's customers University of Pennsylvania

6 © 2015 A. Haeberlen MAD distributed system System + control are distributed 6 University of Pennsylvania In general, systems can have many domains Example: Interdomain routing system in the Internet

7 © 2015 A. Haeberlen Examples of MAD distributed systems The Internet P2P systems: Skype, BitTorrent,... Cloud platforms: EC2, Azure,... MMORPGs: World of Warcraft,... CDNs; Akamai Download Manager Interbank networks Telephone systems Social networks: Facebook,...... Increasing trend towards MAD-style! systems 7 University of Pennsylvania

8 © 2015 A. Haeberlen Today's menu Accountability If some nodes in our distributed system (e.g., cloud) have been compromised, how can we tell? Case study: Detecting novel and unknown cheats in Counterstrike 8 University of Pennsylvania Differential privacy How can we answer questions about sensitive data without (accidentally) compromising someone's privacy? Example: Netflix disaster Goal: Provable privacy guarantees

9 © 2015 A. Haeberlen Outline 9 University of Pennsylvania Accountability Differential privacy

10 © 2015 A. Haeberlen Motivation: Data sharing In Internet systems, we would sometimes like to share data with someone in another domain Examples: Suspicious network traffic, spam, traffic statistics, attack traces, usage statistics, clicks on advertisements,... Some of this data may be related to customers 10 University of Pennsylvania A A B B C C Database with sensitive information

11 © 2015 A. Haeberlen Challenge: Privacy Sharing is difficult due to privacy concerns "Data is like plutonium - you can't touch it" Goal: Share a limited amount of information, but still protect customer privacy Just enough to be useful; not enough to reveal secrets Model: Adversary learns all shared information Pessimistic (but fairly safe) assumption Running example: Database contains network traces; adversary wants to know whether Andreas has cancer or not 11 University of Pennsylvania A A B B C C

12 © 2015 A. Haeberlen Anonymization is not enough Idea: Let's share anonymized data Example: "Tell me all the requests sent from your domain to cancer.com today" Known to be insufficient to protect privacy Example: Netflix deanonymization [Narayanan/Shmatikov] Example: Thelma Arnold and the AOL search dataset 12 University of Pennsylvania 5-digit ZIP code County Year of birth Year and month of birth Year, month, and day of birth 0.2% 4.2% 63.3% 0% 0.2% 14.8% Table 1 from: P. Golle, "Revisiting the Uniqueness of Simple Demographics in the U.S. Population", WPES 2006 Results based on the 2000 census

13 © 2015 A. Haeberlen Example: Netflix prize dataset In 2006, Netflix released anonymized movie ratings from ~500,000 customers Goal: Hold a competition for better movie recommendation algorithms This data was 'de-anonymized' by two researchers from UT Austin... by correlating the private/anonymized Netflix ratings with the public ratings from the Internet Movie Database Result: Privacy breach User may have rated a lot more movies on Netflix than on IMDb, e.g., ones that reflect on his sexual preferences or religious/political views! 13 University of Pennsylvania

14 © 2015 A. Haeberlen Aggregation is not enough Idea: Let's share only aggregate information Example: "How many different IPs made requests to cancer.com from this network today?" Problem: Outside information Adversary might already know that 317 persons in the network have cancer, but may not be sure about Andreas If the answer is 318, he knows that Andreas has cancer Idea: Add some noise to the answer "317 people have cancer, plus or minus 2" What can the (Bayesian) adversary conclude from this? 14 University of Pennsylvania

15 © 2015 A. Haeberlen Differential privacy Idea: Quantify how much more the adversary can learn if some individual X allows his data to be included in the database 15 University of Pennsylvania Should I allow my data to be included? Yes No "How many requests to cancer.com today?" Difference Laplace noise X (Bayesian) adversary Is X in the dataset?

16 © 2015 A. Haeberlen Privacy budgets What if the adversary can ask more than one question? Repeat the same question: Obtain multiple samples from the distribution, increase confidence Different questions: Can correlate answers Idea: Assign a 'privacy budget' to each querier Represents how much privacy the owner of the box is willing to give up (e.g.,  =0.01) Cost of each query is deducted from the budget Querier can continue asking until budget is exhausted Number of questions depends on how private each of them is 16 University of Pennsylvania

17 © 2015 A. Haeberlen What is so special about diff. privacy? We do NOT have to assume anything about: adversary's outside knowledge adversary's goals adversary's inability to learn answers to queries Nevertheless we can ensure that: Even if the adversary can ask anything he wants, he can be only a little more confident that Andreas has cancer than he was before Extremely strong guarantee! Important for convincing customers that it is 'okay' to share their data in this way 17 University of Pennsylvania

18 © 2015 A. Haeberlen Recap: What we have so far Goal: Share (some) data while provably protecting privacy Approach: Differential privacy Next: Technical part 18 University of Pennsylvania A A B B C C

19 © 2015 A. Haeberlen The importance of automation What if someone asks the following query: "If Andreas is in the database and has cancer, return 1,000,000; otherwise 0" How do we know... whether it is okay to answer this (given our bound  )? and, if so, how much noise we need to add? Analysis can be done manually... Example: McSherry/Mironov [KDD'09] on Netflix data... but this does not scale! Each database owner would have to hire a 'privacy expert' Analysis is nontrivial - what if the expert makes a mistake? 19 University of Pennsylvania

20 © 2015 A. Haeberlen Sensitivity, and why it matters What is the difference between these queries? Answer: Their sensitivity If we add or remove one person's data from the database, how much can the answer change (at most)? Why does the sensitivity matter for diff.priv.? The higher the sensitivity, the more noise we need to add If too high (depending on  ), we can't answer at all 20 University of Pennsylvania 1) How many people looked at cancer.com today? 2) How many requests went to cancer.com today? 3) If Andreas is in the database, return 1,000,000; else 0

21 © 2015 A. Haeberlen Intuition behind the type system Suppose we have a function f(x)=2x+7 What is its sensitivity? Intuitively 2: changing the input by 1 changes the output by 2 21 University of Pennsylvania f(x) = x Sensitivity 1 f(x) = 7 Sensitivity 0 f(x) = 2*x Sensitivity 2*1 f(x) = 2*x + 7 Sensitivity 2*1 + 0

22 © 2015 A. Haeberlen A type system for inferring sensitivity We can use a type system to infer sensitivity 22 University of Pennsylvania , x : k  e :  '  x.e :  ' k If we know that the value of x is used k times in e......then the function x.e is k-sensitive in x y(x) = 3*x + 4

23 © 2015 A. Haeberlen This works for an entire language! The full set of typing rules can be built into a simple functional programming language If program typechecks, we have a proof that running it won't compromise privacy 23 University of Pennsylvania

24 © 2015 A. Haeberlen Putting everything together Queries are written in a special programming language Runtime ensures that results are noised appropriately Result is still useful, despite the noise 24 University of Pennsylvania Machine with language runtime Database of traffic statistics query(db:database) { num = 0; foreach x  db if (x is DNS lookup for 'bot*.com') then num ++; return num; } 17,182 17,145 Querier +Noise ~100

25 © 2015 A. Haeberlen Fuzz protects private information What if an adversary asks for private information? Perhaps a network has been compromised by a hacker The answer reveals almost nothing! Information 'drowns' in the noise 25 University of Pennsylvania -47 query(db:database) { foreach x  db if (x is DNS lookup for 'cancer.com' by Andreas) then return 1; return 0; } Adversary 1 +Noise ~100

26 © 2015 A. Haeberlen Problem: Query over distributed data Scenario: Multiple players with databases Example: Airline reservations and medical records Goal: Distributed query over all databases Example: "How many persons recently travelled to Elbonia and were diagnosed with Malaria?" 26 University of Pennsylvania WhoWhere AliceUK BobElbonia CharlieBrazil DorisElbonia EmilFrance WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician

27 © 2015 A. Haeberlen Challenge #1: Privacy Can we give the querier the full data? Absolutely not! Privacy nightmare; illegal (HIPAA) Differential privacy could help! Solutions exist: PINQ, Fuzz, Airavat... But they assume a single database! 27 University of Pennsylvania WhoWhere AliceUK BobElbonia CharlieBrazil DorisElbonia EmilFrance WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician Querier "Approx. 203.2 patients were in Nigeria and had malaria"

28 © 2015 A. Haeberlen Challenge #2: Distribution But what if we don't have a trusted party? 28 University of Pennsylvania Idea #1: Give all data to a trusted party Idea #2: Use secure multiparty computation Idea #3: Use PDDP [NSDI'12] Most queries would take years to run Only works for certain types of queries (not including joins) Trusty Tim Querier MPC circuit PDDP proxy Querier

29 © 2015 A. Haeberlen Our approach Many practical queries are not 'full' JOINs No actual cross product Just matching up rows from different databases Morally, many of them are set intersections { people who went to Elbonia }  { people who had malaria } And we know how to do those privately! Example: Freedman's protocol (Eurocrypt'04) Original protocol doesn't offer differential privacy - but we found a way to extend it 29 University of Pennsylvania SELECT COUNT(p) FROM travel JOIN medical WHERE travel.dest='Elbonia' AND medical.diagnosis='Malaria'

30 © 2015 A. Haeberlen DJoin from 10,000 feet DJoin processes JOIN queries over private data Data may be distributed across multiple curators DJoin offers (computational) differential privacy Querier learns only the noised result, and nothing else Curators only learn (noised) intermediate results 30 University of Pennsylvania WhoWhere AliceUK BobElbonia CharlieBrazil DorisElbonia EmilFrance WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician Querier How many people went to Elbonia and had malaria? Query Result 202.2±3 "Privacy firewall"

31 © 2015 A. Haeberlen Outline Problem: Queries over distributed databases Challenge: Join queries Approach: Set intersections Our solution: DJoin Background: Homomorphic crypto, PSI-CA Making PSI-CA differentially private Denoise-combine-renoise (DCR) The DJoin system Evaluation Summary 31 University of Pennsylvania NEXT

32 © 2015 A. Haeberlen Background: Homomorphic crypto We use a homomorphic cryptosystem Some operations can be done on ciphertexts - specifically, addition, and multiplication by a plain-text constant For example, the Paillier cryptosystem has this property Consequence: Can evaluate an encrypted polynomial on a plain-text value, and get encrypted result 32 University of Pennsylvania 3 4 Enc(3)Enc(4) Enc(7) 7 + = = 35 Enc(3) Enc(15) 15 * = = 5 Homomorphic additionHomomorphic multiplication (by a plain-text constant)

33 © 2015 A. Haeberlen PSI-CA without differential privacy Private set-intersection cardinality [FNP04] Alice and Bob have sets A and B, and want to compute |A  B| without revealing their set elements to each other Alice makes a polynomial P whose roots are the elements of A Alice encrypts the coefficients and sends them to Bob Bob evaluates P on elements of B and returns results to Alice Alice decrypts the results and counts the number of zeroes 33 University of Pennsylvania Alice Bob 42 17 3 A 3 8 22 B P(x)=4(x-42)(x-3)(x-17)P(x)=4x 3 -248x 2 +3564x-8568 Enc(4), Enc(-248), Enc(3564), Enc(8568) Enc(6120), Enc(0), Enc(0), Enc(-7600) Two zeroes! |A  B| = 2 6120, 0, 0, -7600

34 © 2015 A. Haeberlen Challenges This protocol is not differentially private because: Alice learns the exact (un-noised) size of |A  B| Alice and Bob learn the size of each other's set What can we do? 34 University of Pennsylvania Alice Bob 42 17 3 A 3 8 22 B Enc(4), Enc(-248), Enc(3564), Enc(8568) Enc(6120), Enc(0), Enc(0), Enc(-7600) 6120, 0, 0, -7600 P(x)=4(x-42)(x-3)(x-17)

35 © 2015 A. Haeberlen BN-PSI-CA Idea #1: Add noise to the result Bob can encrypt a few extra zeroes & add them to the result Problem: Bob can't remove zeroes (encrypted!) Add C+n zeroes, where n is noise (drawn from the 'right' distributon) and C is a well-known 'offset'? Problem: Alice can tell how much noise has been added Solution: Add C+n zeroes, C-n random numbers 35 University of Pennsylvania Alice Bob 42 17 3 A 3 8 22 B Enc(4), Enc(-248), Enc(3564), Enc(8568) E(6120), E(0), E(0), E(-7600) 6120, 0, 0, -7600 P(x)=4(x-42)(x-3)(x-17), E(0), E(0), 0, 0 Two extra results! Bob added two zeroes!, E(89), 89

36 © 2015 A. Haeberlen BN-PSI-CA Idea #2: Add noise to the polynomial... so that Bob can't infer the size of Alice's set Works similar to the above 36 University of Pennsylvania Alice Bob 42 17 3 A 3 8 22 B Enc(4), Enc(-248), Enc(3564), Enc(8568) E(6120), E(0), E(0), E(-7600) 6120, 0, 0, -7600 P(x)=4(x-42)(x-3)(x-17), E(0), E(0), 0, 0, E(89), 89 (x-892827)(x+123819)

37 © 2015 A. Haeberlen Denoise-Combine-Renoise (DCR) Some queries need more than one BN-PSI-CA Example: Disjunctive predicate on select Need to apply De Morgan's law Problem: Noise adds up 37 University of Pennsylvania SELECT |X.a| FROM X,Y WHERE X.a=Y.a OR X.b=Y.b |X.a  Y.a||X.b  Y.b| |X.ab  Y.ab| Answer = = |X.a  Y.a|+|X.b  Y.b|-|X.ab  Y.ab|+ (+) ( + ) ++ ) ( -

38 © 2015 A. Haeberlen Denoise-Combine-Renoise (DCR) Idea: Remove noise temporarily Each player can remember how much noise they added We can use secure multi-party computation (MPC) Based on an algorithm by Dwork et al. [EUROCRYPT 2006] MPC is extremely expensive - but adding a few numbers is ok 38 University of Pennsylvania |X.a  Y.a| |X.ab  Y.ab| |X.a  Y.a|+|X.b  Y.b|-|X.ab  Y.ab|+ + Alice Bob + |X.b  Y.b| + MPC circuit

39 © 2015 A. Haeberlen The DJoin system We built a system called DJoin that uses BN-PSI-CA and DCR to answer queries DJoin solves several other challenges: Rewrites SQL-like queries in terms of BN-PSI-CAs Manages a 'privacy budget' to prevent queriers from issuing too many queries Performs a sensitivity analysis to determine how much noise is necessary for a given query Uses an encoding to support certain types of JOINs that are not equivalent to intersections Manages the local databases on each node Secures query execution against malicious queriers 39 University of Pennsylvania This lecture In the paper

40 © 2015 A. Haeberlen Query rewriting 40 University of Pennsylvania AB  A.diag='malaria'  A.ssn=B.ssn  A.id=B.id  |  | SELECT COUNT(A.id) FROM A,B WHERE A.diag='malaria' AND (A.ssn=B.ssn OR A.id=B.id)    AB  diag='malaria'  ssn,id   ssn  id Distributed join not supported! DCR |A  B|=|A|+|B|-|A  B| Select "pushed through" join Only local operations, BN-PSI-CAs, and DCR - will work in DJoin!

41 © 2015 A. Haeberlen Limitations This works if the WHERE clause contains arbitrary operations on individual databases, conjunctions and disjunctions of equalities across databases (A.x=B.y AND (A.z=B.z OR A.c=B.q)), and/or certain inequalities and numeric comparisons We have a set of rewrite rules in the paper http://www.cis.upenn.edu/~ahae/papers/djoin-osdi2012.pdf Some JOINS cannot be supported Some because they wouldn't be differentially private Others because we don't know how to efficiently encode them as set intersections Example: Check whether a string from one database appears as a substring in the other 41 University of Pennsylvania

42 © 2015 A. Haeberlen Outline Problem: Queries over distributed databases Challenge: Join queries Approach: Set intersections Our solution: DJoin Background: Homomorphic crypto, PSI-CA Making PSI-CA differentially private Denoise-combine-renoise (DCR) The DJoin system Evaluation Summary 42 University of Pennsylvania NEXT

43 © 2015 A. Haeberlen Evaluation: Questions What kinds of queries can you run? How long do these queries take? How well does BN-PSI-CA scale with the size of the database? Can you parallelize BN-PSI-CA? How expensive is the DCR step? I will present a sample of our results 43 University of Pennsylvania

44 © 2015 A. Haeberlen Experimental setup We built a DJoin prototype mySQL for local databases, FairplayMP for SMC BN-PSI-CA based on Kissner/Song + Paillier cryptosystem Various optimizations to speed up BN-PSI-CA (see paper) Supports joins with more than two parties Experiments on five normal cluster machines Xeon E5530 2.4GHz, 12GB memory, GBit Ethernet Database: 15,000 rows of synthetic data 44 University of Pennsylvania

45 © 2015 A. Haeberlen Evaluation: BN-PSI-CA runtime How long does BN-PSI-CA take? Almost linear in database size: O(|S 1 |+|S 2 | ln ln |S 1 |) Embarrassingly parallel; speedup 3.98 with four cores Nontrivial computation cost (minutes) - not suitable for interactive use 45 University of Pennsylvania Computation time (minutes) Number of elements in each party's set 5,000 10,000 15,000 20,00025,000 30,000 120 100 80 60 40 20 0

46 © 2015 A. Haeberlen Evaluation: Expressivity What kinds of queries can you write? SQL-like syntax for usability Full SQL allowed for local operations Number of BN-PSI-CAs depends on complexity of the query 46 University of Pennsylvania Query#BN-PSI-CA Q1 SELECT NOISY COUNT(A.x) FROM A,B WHERE A.x=B.y 1 Q2 SELECT NOISY COUNT(A.x) FROM A,B WHERE A.x=B.x AND (A.y!=B.y) 2 2 Q3 SELECT NOISY COUNT(A.x) FROM A,B WHERE A.x=B.y AND (A.z="x" OR B.p="y") 2 Q4 SELECT NOISY COUNT(A.x) FROM A,B WHERE A.x=B.x OR A.y=B.y 3 Q5 SELECT NOISY COUNT(A.x) FROM A,B WHERE A.x LIKE "%xyz%" AND A.w=B.w AND (B.y+B.z>10) AND (A.y>B.y) 8

47 © 2015 A. Haeberlen Evaluation: Turnaround time How long do these queries take to run? For databases with 15,000 rows, between 1 and 8 hours Could be sped up with additional cores/machines For comparison: A naive implementation in MPC takes 40 seconds for 8 rows (!) and scales quadratically with #rows 47 University of Pennsylvania Q1 Q2 Q3 Q4 Q5 Completion time (minutes) 500 400 300 200 100 0

48 © 2015 A. Haeberlen Summary: DJoin A differentially private query processor for distributed databases First practical solution that supports joins (with some restrictions) Based on two novel primitives BN-PSI-CA: Differentially private set intersection cardinality DCR: Denoise-combine-renoise Not fast enough for interactive use, but may be sufficient for offline data analysis 48 University of Pennsylvania More information: http://privacy.cis.upenn.edu/

49 © 2015 A. Haeberlen Outline 49 University of Pennsylvania Accountability Differential privacy

50 © 2015 A. Haeberlen Scenario: Multiplayer game Alice decides to play a game of Counterstrike with Bob and Charlie 50 University of Pennsylvania Alice Bob Charlie Network I'd like to play a game

51 © 2015 A. Haeberlen What Alice sees University of Pennsylvania Movie 51 Alice

52 © 2015 A. Haeberlen Could Bob be cheating? In Counterstrike, ammunition is local state Bob can manipulate counter and prevent it from decrementing Such cheats (and many others) do exist, and are being used 52 University of Pennsylvania Charlie Network Alice Bob Ammo 35 36 37

53 © 2015 A. Haeberlen This talk Cheating is a serious problem in itself Multi-billion-dollar industry But we address a more general problem: Alice relies on software that runs on a third-party machine Examples: Competitive system (auction), federated system... How does Alice know if the software running as intended? 53 University of Pennsylvania Network Alice Bob Software is not (just) about cheating!

54 © 2015 A. Haeberlen Goal: Accountability We want Alice to be able to Detect when the remote machine is faulty Obtain evidence of the fault that would convince a third party Challenges: Alice and Bob may not trust each other Possibility of intentional misbehavior (example: cheating) Neither Alice nor Bob may understand how the software works Binary only - no specification of the correct behavior 54 University of Pennsylvania Network Alice Bob Software

55 © 2015 A. Haeberlen Outline Problem: Detecting faults on remote machines Example: Cheating in multiplayer games Solution: Accountable Virtual Machines Evaluation Using earlier example (cheating in Counterstrike) Summary 55 University of Pennsylvania NEXT

56 © 2015 A. Haeberlen Overview Bob runs Alice's software image in an AVM AVM maintains a log of network in-/outputs Alice can check this log with a reference image AVM correct: Reference image can produce same network outputs when started in same state and given same inputs AVM faulty: Otherwise 56 University of Pennsylvania Network Alice Bob Virtual machine image AVMM AVM Accountable Virtual Machine (AVM) Accountable Virtual Machine Monitor (AVMM) Log What if Bob manipulates the log? Alice must trust her own reference image How can Alice find this execution, if it exists?

57 © 2015 A. Haeberlen Firing Tamper-evident logging Message log is tamper-evident [SOSP'07] Log is structured as a hash chain Messages contain signed authenticators Result: Alice can either...... detect that the log has been tampered with, or... get a complete log with all the observable messages 57 University of Pennsylvania 473: SEND(Charlie, Got ammo) 472: RECV(Alice, Got medipack) 471: SEND(Charlie, Moving left)... 474: SEND(Alice, Firing) Moving right AVMM AVM

58 © 2015 A. Haeberlen Execution logging How does Alice know whether the log matches a correct execution of her software image? Idea: AVMM can specify an execution AVMM additionally logs all nondeterministic inputs AVM correct: Can replay inputs to get execution AVM faulty: Replay inevitably (!) fails 58 University of Pennsylvania 474: SEND(Alice, Firing) 473: SEND(Charlie, Got ammo) 472: RECV(Alice, Got medipack) 471: SEND(Charlie, Moving left)... AVMM AVM 474: SEND(Alice, Firing) 473: Mouse button clicked 472: SEND(Charlie, Got ammo) 471: RECV(Alice, Got medipack) 470: Got network interrupt 469: SEND(Charlie, Moving left)

59 © 2015 A. Haeberlen Auditing and replay 59 University of Pennsylvania Network Alice Bob AVMM AVM AVMM AVM... 371: SEND(Alice, Firing) 370: SEND(Alice, Firing) 369: SEND(Alice, Firing) 368: Mouse button clicked 367: SEND(Alice, Got medipack) 366: Mouse moved left Modification Evidence 371: SEND(Alice, Firing) 370: SEND(Alice, Firing) 369: SEND(Alice, Firing) 368: Mouse button clicked 367: SEND(Alice, Got medipack) 366: Mouse moved left 372: SEND(Alice, Firing) 373: SEND(Alice, Firing)

60 © 2015 A. Haeberlen AVM properties Strong accountability Detects faults Produces evidence No false positives Works for arbitrary, unmodified binaries Nondeterministic events can be captured by AVM Monitor Alice does not have to trust Bob, the AVMM, or any software that runs on Bob's machine If Bob tampers with the log, Alice can detect this If Bob's AVM is faulty, ANY log Bob could produce would inevitably cause a divergence during replay 60 University of Pennsylvania If it runs in a VM, it will work

61 © 2015 A. Haeberlen Outline Problem: Detecting faults on remote machines Example: Cheating in multiplayer games Solution: Accountable Virtual Machines Evaluation Using earlier example (cheating in Counterstrike) Summary 61 University of Pennsylvania NEXT

62 © 2015 A. Haeberlen Methodology We built a prototype AVMM Based on logging/replay engine in VMware Workstation 6.5.1 Extended with tamper-evident logging and auditing Evaluation: Cheat detection in games Setup models competition / LAN party Three players playing Counterstrike 1.6 Nehalem machines (i7 860) Windows XP SP3 62 University of Pennsylvania

63 © 2015 A. Haeberlen Evaluation topics Effectiveness against real cheats Overhead Disk space (for the log) Time (auditing, replay) Network bandwidth (for authenticators) Computation (signatures) Latency (signatures) Impact on game performance Online auditing Spot checking tradeoffs Using a different application: MySQL on Linux 63 University of Pennsylvania Please refer to the paper for additional results!

64 © 2015 A. Haeberlen AVMs can detect real cheats If the cheat needs to be installed in the AVM to be effective, AVM can trivially detect it Reason: Event timing + control flow change Examined real 26 cheats from the Internet; all detectable 64 University of Pennsylvania 98: RECV(Alice, Missed) 97: SEND(Alice, Fire@(3,9)) 96: Mouse button clicked 95: Interrupt received 94: RECV(Alice, Jumping)... BC=53 BC=52 BC=47 BC=44 BC=37... Bob's log EIP=0xb382 EIP=0x3633 EIP=0xc490 EIP=0x6771 EIP=0x570f... Event timing (for replay) AVMM AVM BC=59 BC=54 BC=49 BC=44 BC=37... EIP=0x861e EIP=0x2d16 EIP=0xc43e EIP=0x6771 EIP=0x570f... 97: SEND(Alice, Fire@(2,7)) 98: RECV(Alice, Hit)

65 © 2015 A. Haeberlen 96: RECV(Alice, Missed) 95: SEND(Alice, Fire@(3,9)) 94: Mouse button clicked 93: Interrupt received 92: RECV(Alice, Jumping)... BC=53 BC=52 BC=47 BC=44 BC=37... EIP=0xb382 EIP=0x3633 EIP=0xc490 EIP=0x6771 EIP=0x570f... 99: RECV(Alice, Hit) 98: SEND(Alice, Fire@(2,7)) 97: Mouse button clicked 96: Mouse move right 1 inch 94: Mouse move up 1 inch 92: RECV(Alice, Jumping)... BC= BC= BC= BC= BC= BC=... EIP= EIP= EIP= EIP= EIP= EIP=... AVMs can detect real cheats Couldn't cheaters adapt their cheats? There are three types of cheats: 1. Detection impossible (Example: Collusion) 2. Detection not guaranteed, but evasion technically difficult 3. Detection guaranteed (  15% of the cheats in our sample) 65 University of Pennsylvania AVMM AVM ? ? ? ? ? ? ? ? ? ? ?

66 © 2015 A. Haeberlen Impact on frame rate Frame rate is ~13% lower than on bare hw 137fps is still a lot! 60--80fps generally recommended 11% due to logging; additional cost for accountability is small 66 University of Pennsylvania 200 150 100 50 0 Average frame rate Bare hardware VMware (no logging) VMware (logging) AVMM (no crypto) AVMM  158fps -13% Different machines with different players -11% No fps cap Window mode 800x600 Softw. rendering

67 © 2015 A. Haeberlen Cost of auditing When auditing a player after a one-hour game, How big is the log we have to download? How much time is needed for replay? 67 University of Pennsylvania VMware AVMM Average log growth (MB/minute) 12 10 8 6 4 2 0 ~8 MB per minute 2.47 MB per minute (compressed) 148 MB Added by accountability ~ 1 hour

68 © 2015 A. Haeberlen Online auditing Idea: Stream logs to auditors during the game Result: Detection within seconds after fault occurs Replay can utilize unused cores; frame rate penalty is low 68 University of Pennsylvania 200 150 100 50 0 Average frame rate No online auditing One audit per player Two audits per player Alice Bob Charlie Game Logging Replay

69 © 2015 A. Haeberlen Summary Accountable Virtual Machines (AVMs) offer strong accountability for unmodified binaries Useful when relying on software executing on remote machines: Federated system, multiplayer games,... No trusted components required AVMs are practical Prototype implementation based on VMware Workstation Evaluation: Cheat detection in Counterstrike 69 University of Pennsylvania More information: http://accountability.cis.upenn.edu/

70 © 2015 A. Haeberlen 70 University of Pennsylvania I hope you liked NETS212! Please don't forget to complete your course evaluations


Download ppt "© 2015 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Special topics December 3, 2015."

Similar presentations


Ads by Google