Presentation is loading. Please wait.

Presentation is loading. Please wait.

A. Haeberlen CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Special topics April 22, 2016.

Similar presentations


Presentation on theme: "A. Haeberlen CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Special topics April 22, 2016."— Presentation transcript:

1 A. Haeberlen CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Special topics April 22, 2016

2 A. Haeberlen Current projects Accountability and evidence If some nodes in our distributed system (e.g., cloud) have been compromised, how can we tell? Case study: Detecting novel and unknown cheats in Counterstrike 2 University of Pennsylvania Differential privacy How can we answer questions about sensitive data without (accidentally) compromising someone's privacy? Example: Netflix disaster Goal: Provable privacy guarantees

3 A. Haeberlen Current projects 3 University of Pennsylvania Network forensics If you discover that some part of your cloud has been broken into, how can you tell what to clean up? Approach: Secure provenance Resilient cloud If part of your cloud service gets compromised, can you limit the damage, and perform automatic recovery? Approach: Communities of trust

4 A. Haeberlen For more information Accountability and evidece http://accountability.cis.upenn.edu/ Differential privacy http://privacy.cis.upenn.edu/ Network forensics http://snp.cis.upenn.edu/ Resilient cloud http://sound.cis.upenn.edu/ 4 University of Pennsylvania

5 A. Haeberlen Outline 5 University of Pennsylvania Network forensics Accountability & evidence Resilient cloud Differential privacy

6 A. Haeberlen Scenario: Multiplayer game Alice decides to play a game of Counterstrike with Bob and Charlie 6 University of Pennsylvania Alice Bob Charlie Network I'd like to play a game

7 A. Haeberlen What Alice sees University of Pennsylvania Movie 7 Alice

8 A. Haeberlen Could Bob be cheating? In Counterstrike, ammunition is local state Bob can manipulate counter and prevent it from decrementing Such cheats (and many others) do exist, and are being used 8 University of Pennsylvania Charlie Network Alice Bob Ammo 35 36 37

9 A. Haeberlen The state of the art Anti-cheating techniques VAC2, PunkBuster,... Can only find known types of cheats Privacy concerns Memory scanning, sending screenshots... Want to know more? Hoglund/McGraw: Exploiting online games (Addison-Wesley) 9 University of Pennsylvania

10 A. Haeberlen This is not (just) about cheating! Cheating is a serious problem in itself Multi-billion-dollar industry But there is a more general problem: Alice relies on software that runs on a third-party machine This is true of the cloud (and many other types of systems) How does Alice know if the software running as intended? 10 University of Pennsylvania Network Alice Bob Software

11 A. Haeberlen What do we do in the 'offline' world? Misbehavior isn't unique to computers What if you run a company and somebody embezzles money? What if someone misreports their income on their tax return? What if a professor spends his grant money on a Ferrari?... In the 'offline world, these aren't huge issues Why? 11 University of Pennsylvania

12 A. Haeberlen 12 University of Pennsylvania Learning from the 'offline' world In the 'offline' world, we rely on accountability Accountability can: Detect faults Identify the faulty person/entity Convince others of the fault (by producing evidence) Can we apply this approach to distributed systems? Tamper-evident record Auditors Incriminating evidence Expected behavior

13 A. Haeberlen Goal: Accountability We want Alice to be able to Detect when the remote machine is faulty Obtain evidence of the fault that would convince a third party Challenges: Alice has no idea how Bob's cheat might work (if he has one) How to look for something when you don't know what it looks like? Neither Alice nor Bob may understand how the software works Binary only - no specification of the correct behavior 13 University of Pennsylvania Network Alice Bob Software

14 A. Haeberlen Accountable Virtual Machines Bob runs Alice's software image in an AVM AVM maintains a log of network in-/outputs Alice can check this log with a reference image Alice can detect a large class of cheats - without having to know how the cheat works! Bob cannot prevent detection! Conversely, if Bob has not cheated, he can demonstrate this 14 University of Pennsylvania Network Alice Bob Virtual machine image AVMM AVM Accountable Virtual Machine (AVM) Accountable Virtual Machine Monitor (AVMM)

15 A. Haeberlen Firing Tamper-evident logging Message log is tamper-evident [SOSP'07] Log is structured as a hash chain Messages contain signed authenticators Result: Alice can either...... detect that the log has been tampered with, or... get a complete log with all the observable messages 15 University of Pennsylvania 473: SEND(Charlie, Got ammo) 472: RECV(Alice, Got medipack) 471: SEND(Charlie, Moving left)... 474: SEND(Alice, Firing) Moving right AVMM AVM

16 A. Haeberlen Execution logging How does Alice know whether the log matches a correct execution of her software image? Idea: AVMM can specify an execution AVMM additionally logs all nondeterministic inputs AVM correct: Can replay inputs to get execution AVM faulty: Replay inevitably (!) fails 16 University of Pennsylvania 474: SEND(Alice, Firing) 473: SEND(Charlie, Got ammo) 472: RECV(Alice, Got medipack) 471: SEND(Charlie, Moving left)... AVMM AVM 474: SEND(Alice, Firing) 473: Mouse button clicked 472: SEND(Charlie, Got ammo) 471: RECV(Alice, Got medipack) 470: Got network interrupt 469: SEND(Charlie, Moving left)

17 A. Haeberlen Auditing and replay 17 University of Pennsylvania Network Alice Bob AVMM AVM AVMM AVM... 371: SEND(Alice, Firing) 370: SEND(Alice, Firing) 369: SEND(Alice, Firing) 368: Mouse button clicked 367: SEND(Alice, Got medipack) 366: Mouse moved left Unlimited ammo patch Evidence 371: SEND(Alice, Firing) 370: SEND(Alice, Firing) 369: SEND(Alice, Firing) 368: Mouse button clicked 367: SEND(Alice, Got medipack) 366: Mouse moved left 372: SEND(Alice, Firing) 373: SEND(Alice, Firing)

18 A. Haeberlen But what about my frame rate? Frame rate is ~13% lower than on bare hw 137fps is still a lot! 60--80fps generally recommended 11% due to logging; additional cost for accountability is small 18 University of Pennsylvania 200 150 100 50 0 Average frame rate Bare hardware VMware (no logging) VMware (logging) AVMM (no crypto) AVMM  158fps -13% Different machines with different players -11% No fps cap Window mode 800x600 Softw. rendering

19 A. Haeberlen Summary: Accountability & evidence Nodes in a distributed system can misbehave Compromised by attacker, manipulated by operator,... Approach: Accountability Reliably detect when misbehavior occurs Diagnose the misbehavior (identify which nodes are affected) Provide irrefutable evidence of the misbehavior Owner of affected node(s) may deny everything, or may not even be aware that the misbehavior is occurring! Challenge: Detection and evidence How to prove that something did (not) happen in a dist.sys.?... especially if nodes are telling lies or are actively covering their traces? Ongoing research - e.g., PeerReview, AVM,... Provable detection guarantees, cryptographic evidence 19 University of Pennsylvania

20 A. Haeberlen Outline 20 University of Pennsylvania Network forensics Accountability & evidence Resilient cloud Differential privacy

21 A. Haeberlen Opportunity: 'Big Data' Data is accumulating in systems every day What could we do with all this data? Medical research, traffic planning, better movie recom- mendations, fewer ads, less congested networks,... But what about privacy? Very hard to protect (Netflix!)  Reluctance to share data We can either guarantee privacy or work with the data 21 University of Pennsylvania Data Data analystsData owner

22 A. Haeberlen Challenge: Privacy Sharing is difficult due to privacy concerns "Data is like plutonium - you can't touch it" Goal: Share a limited amount of information, but still protect customer privacy Just enough to be useful; not enough to reveal secrets Model: Adversary learns all shared information Pessimistic (but fairly safe) assumption Running example: Database contains network traces; adversary wants to know whether Andreas has cancer or not 22 University of Pennsylvania A A B B C C

23 A. Haeberlen Anonymization is not enough Idea: Let's share anonymized data Example: "Tell me all the requests sent from your domain to cancer.com today" Known to be insufficient to protect privacy Example: Netflix deanonymization [Narayanan/Shmatikov] Example: AOL search dataset 23 University of Pennsylvania 5-digit ZIP code County Year of birth Year and month of birth Year, month, and day of birth 0.2% 4.2% 63.3% 0% 0.2% 14.8% Table 1 from: P. Golle, "Revisiting the Uniqueness of Simple Demographics in the U.S. Population", WPES 2006 Results based on the 2000 census

24 A. Haeberlen Privacy is hard! 24 University of Pennsylvania

25 A. Haeberlen Example: Netflix prize dataset In 2006, Netflix released anonymized movie ratings from ~500,000 customers Goal: Hold a competition for better movie recommendation algorithms This data was 'de-anonymized' by two researchers from UT Austin... by correlating the private/anonymized Netflix ratings with the public ratings from the Internet Movie Database Result: Privacy breach User may have rated a lot more movies on Netflix than on IMDb, e.g., ones that reflect on his sexual preferences or religious/political views! 25 University of Pennsylvania

26 A. Haeberlen Aggregation is not enough Idea: Let's share only aggregate information Example: "How many different IPs made requests to cancer.com from this network today?" Problem: Outside information Adversary might already know that 317 persons in the network have cancer, but may not be sure about Andreas If the answer is 318, he knows that Andreas has cancer Idea: Add some noise to the answer "317 people have cancer, plus or minus 2" What can the (Bayesian) adversary conclude from this? 26 University of Pennsylvania

27 A. Haeberlen Differential privacy Idea: Quantify how much more the adversary can learn if some individual X allows his data to be included in the database 27 University of Pennsylvania Should I allow my data to be included? Yes No "How many requests to cancer.com today?" Difference Laplace noise X

28 A. Haeberlen Privacy budgets What if the adversary can ask more than one question? Repeat the same question: Obtain multiple samples from the distribution, increase confidence Different questions: Can correlate answers Idea: Assign a 'privacy budget' to each querier Represents how much privacy the owner of the box is willing to give up (e.g.,  =0.1) Cost of each query is deducted from the budget Querier can continue asking until budget is exhausted Number of questions depends on how private each of them is 28 University of Pennsylvania

29 A. Haeberlen What is so special about diff. privacy? We do NOT have to assume anything about: adversary's outside knowledge adversary's goals adversary's inability to learn answers to queries Nevertheless we can ensure that: Even if the adversary can ask anything he wants, he can be at most a little more confident that Andreas has cancer than he was before Extremely strong guarantee! Important for convincing customers that it is 'okay' to share their data in this way 29 University of Pennsylvania

30 A. Haeberlen Recap: What we have so far Goal: Share (some) data while provably protecting privacy Approach: Differential privacy 30 University of Pennsylvania A A B B C C

31 A. Haeberlen The importance of automation What if someone asks the following query: "If Andreas is in the database and has cancer, return 1,000,000; otherwise 0" (or some more complex equivalent) How do we know... whether it is okay to answer this (given our bound  )? and, if so, how much noise we need to add? Analysis can be done manually... Example: McSherry/Mironov [KDD'09] on Netflix data... but this does not scale! Each database owner would have to hire a 'privacy expert' Analysis is nontrivial - what if the expert makes a mistake? 31 University of Pennsylvania

32 A. Haeberlen Sensitivity, and why it matters What is the difference between these queries? Answer: Their sensitivity If we add or remove one person's data from the database, how much can the answer change (at most)? Why does the sensitivity matter for diff.priv.? The higher the sensitivity, the more noise we need to add If too high (depending on  ), we can't answer at all But how can we tell how sensitive a query is? 32 University of Pennsylvania 1) How many people looked at cancer.com today? 2) How many requests went to cancer.com today? 3) If Andreas is in the database, return 1,000,000; else 0

33 A. Haeberlen Intuition behind the type system Suppose we have a function f(x)=2x+7 What is its sensitivity to changes in the input variable x? Intuitively 2: changing the input by 1 changes the output by 2 33 University of Pennsylvania f(x) = x Sensitivity 1 to x f(x) = 7 Sensitivity 0 to x f(x) = 2*x Sensitivity 2*1 to x f(x) = 2*x + 7 Sensitivity 2*1 + 0 to x

34 A. Haeberlen A type system for inferring sensitivity We can use a type system to infer sensitivity 34 University of Pennsylvania , x : k  e :  '  x.e :  ' k If we know that the value of x is used k times in e......then the function x.e is k-sensitive in x y(x) = 3*x + 4

35 A. Haeberlen This works for an entire language! The full set of typing rules can be built into a simple functional programming language If program typechecks, we have a proof that running it won't compromise privacy 35 University of Pennsylvania

36 A. Haeberlen Putting everything together Queries are written in a special programming language Runtime ensures that results are noised appropriately Result is still useful, despite the noise 36 University of Pennsylvania Machine with language runtime Database of traffic statistics query(db:database) { num = 0; foreach x  db if (x is DNS lookup for 'bot*.com') then num ++; return num; } 17,182 17,145 Querier +Noise ~100

37 A. Haeberlen Fuzz protects private information What if an adversary asks for private information? Perhaps a network has been compromised by a hacker The answer reveals almost nothing! Information 'drowns' in the noise 37 University of Pennsylvania -47 query(db:database) { foreach x  db if (x is DNS lookup for 'cancer.com' by Andreas) then return 1; return 0; } Adversary 1 +Noise ~100

38 A. Haeberlen Mission accomplished? 38 University of Pennsylvania

39 A. Haeberlen Problem: Side-channel attacks Problem: Adversary cannot just observe the result, but also: How long the query takes, whether there are any run-time exceptions, effect on the privacy budget,... 39 University of Pennsylvania query(db:database) { foreach x  db if (x is DNS lookup for 'cancer.com' by Andreas) then delay for 1 second; return 0; } 0 + noise Adversary =0.1s: Record not present =1.2s: Record present Adversary can learn the private information with certainty! Blatant violation of differential privacy guarantees!

40 A. Haeberlen How serious the problem is Several types of side channels are possible Garbage collection, runtime exceptions, privacy budget... Demonstrated attacks on published systems: PINQ (McSherry, SIGMOD 2009) Airavat (Roy et al., NSDI 2010) Existing defenses are not enough Typically based on reducing the amount of leakage But: Even leaking a single bit of private information may have catastrophic consequences! 40 University of Pennsylvania

41 A. Haeberlen Problem: Query over distributed data Scenario: Multiple players with databases Example: Airline reservations and medical records Goal: Distributed query over all databases Example: "How many persons recently travelled to Nigeria and were diagnosed with Malaria?" 41 University of Pennsylvania WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician

42 A. Haeberlen Challenge #1: Privacy Can we give the querier the full data? Absolutely not! Privacy nightmare; illegal (HIPAA) How about an aggregate answer? Perhaps a total count: "2 patients were in N. & had malaria" Not enough - querier might have some outside information! 42 University of Pennsylvania WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician Querier "2 patients were in Nigeria and had malaria" I know Bob and Doris were there, and I know Bob had malaria!

43 A. Haeberlen WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Differential privacy helps, but... Differential privacy helps! Idea: Give only an approximate answer (add some 'noise') OK if the numbers are big and the noise is small [complex mathematical details omitted] There are some systems that offer diff.priv PINQ, Fuzz, Airavat,... But there is a catch! 43 University of Pennsylvania WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician Querier "203.2 ±3 patients were in Nigeria and had malaria"

44 A. Haeberlen Challenge #2: Distribution Existing solutions assume a single database!... or a specially formed query We can't necessarily gather the data in a single place And we can't apply them separately to each database either (need to 'join' databases to match up individuals) 44 University of Pennsylvania WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician WhoWhereWhat AliceLondonDiarrhea BobAbujaMalaria CharlieVegasDepression DorisLagosMalaria EmilMoscow--- Fannie---Flu George---Hiccup Hank---Broken rib... Querier Noise

45 A. Haeberlen DJoin from 10,000 feet DJoin processes JOIN queries over private data Data may be distributed across multiple curators DJoin guarantees differential privacy Querier learns only the (noised) result, and nothing else Curators only lean (noised) intermediate results 45 University of Pennsylvania WhoWhere AliceLondon BobAbuja CharlieVegas DorisLagos EmilMoscow WhoWhat BobMalaria CharlieDepression FannieFlu GeorgeHiccup WhoWhat AliceDiarrhea DorisMalaria HankBroken rib Travel agent Physician Querier How many people went to Nigeria and had malaria? Query Result 202.2±3 "Privacy firewall"

46 A. Haeberlen Ongoing work: Fuzz Fuzz: A fully automated system for sharing private information across domains Special programming language, type system, and inference algorithm for query sensitivity Protection against side-channel attacks DJoin: An extension of Fuzz that works even when the data itself is distributed! No single entity learns the entire data set! Strong differential privacy guarantees! 46 University of Pennsylvania

47 A. Haeberlen 47 University of Pennsylvania I hope you liked CIS455/555! Please don't forget to complete your course evaluations


Download ppt "A. Haeberlen CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Special topics April 22, 2016."

Similar presentations


Ads by Google