Presentation is loading. Please wait.

Presentation is loading. Please wait.

Security and Deduplication in the Cloud

Similar presentations


Presentation on theme: "Security and Deduplication in the Cloud"— Presentation transcript:

1 Security and Deduplication in the Cloud
Danny Harnik - IBM Haifa Research Labs

2 What is Deduplication Deduplication: storing only a single copy of redundant data Applied at the file or block level Major savings in backup environments (saves more than 90% in common business scenarios) “most impactful storage technology” April 2008: IBM acquires Dilligent July 2009: EMC acquires DataDomain July 2010: DELL acquires Ocarina Explain why the savings are so high 2 2

3 How are files deduped? Fingerprint each file using a hash function
Common hashes used: Sha1, Sha256, others… Store an index of all the hashes already in the system New file: Compute hash Look hash up in index table If new → add to index If known hash → store as pointer to existing data Explain why the savings are so high 3 3

4 Client-side deduplication
Save bandwidth as well as storage. Also know as “source-based dedupe” or “WAN deduplication” Client computes hash and sends to server If new → server requests client for the file (upload data) Otherwise (dedupe) → skip upload and register the client as another owner of the file Client Server Explain why the savings are so high Index Let it be.mp3 hash 2fd4e1 2fd4e1 2fd4e1 4 Let it be.mp3 4

5 Deduplication and privacy
Our attacks are relevant to the following setting: Client-side deduplication Cross-user deduplication If two or more users store the same file, only a single copy is stored. Checking whether the features hold: (1) download a popular file and back it up; (2) use two user accounts and run some checks. Cost of communication at S3 is about 1-2 months of storage. 5 5

6 Cloud storage and deduplication
Cloud storage services are gaining popularity Online file backup and synchronization is huge Lots to gain from deduplication Use/used cross-user client-side deduplication Mozy Dropbox Memopal MP3Tunes 6 6

7 Deduplication and privacy I
Harnik, Pinkas & Shulman-Peleg, IEEE Journal of Security and Privacy, Vol Client learns if an object is already in system A narrow “peep hole” to contents of other users Discussed attacks and partial solutions Illegal content searching “Salary attack” Covert channel Several ways to prevent: Encrypt or dedupe server side only Dedupe only on long files Noisy dedupe… Asking once is only true in file-based dedup. Block-based dedup allows to ask multiple times. 7 7

8 Deduplication and privacy II
Halevi, Harnik, Pinkas & Shulman-Peleg, ACM CCS 2011 A more direct attack Starting point: Suppose I get the hash value of your file… Asking once is only true in file-based dedup. Block-based dedup allows to ask multiple times. 8 8

9 The attack Attacker obtains hash of victim’s file
Signs up for the service with own account Attempts to upload a file, but swaps the hash value with that of the victim’s file. File is now registered to attacker Download file… Client Server Index Any file hash 2fd4e1 e3b890 2fd4e1 2fd4e1 9 9 Let it be.mp3

10 Obtaining the hash Hash used for other services Malicious software
Hash does not reveal “anything” on the file – not meant to be secret Malicious software Easier to send a small signature undetected Also true for break-in at the server side CDN attack Alice sends all her friends the hash of a movie Friends can download it from the server Server essentially serves as a Content Distribution Network (CDN). Might break its cost structure, if it planned on serving only a few restore ops. This is just an example of this attack 10 10

11 Swapping the hash [Dorrendorf & Pinkas 2011] Dropship (April, 2011)
Implemented the attacks against two major storage servers One services uses SHA256 to identify files Another uses a 160 bit hash value which was not identified Dropship (April, 2011) implementation of the CDN over dropbox “written in Python. Allow you to download to your Dropbox any file, which description we got in JSON format (similar as description propagated in .torrent files).” [Mulazzani, Schrittwieser, Leithner, Huber & Weippl 2011] Implemented the attack on Dropbox In Usenix Security 2011 A non-issue in upcoming cloud storage standards 11

12 SOLUTIONS ! 12

13 Naïve Solutions Use a non-standard hash
(e.g. Hash(“service name” | file) ) But all clients must know hash function Irrelevant in most scenarios (CDN/malicious software etc..) 13

14 Better naïve Solutions
Use a challenge-response phase For every upload, server picks a random nonce, and asks client to compute Hash( nonce | file ) This requires client to have the file But the server, too, must now retrieve the file from secondary storage, and compute the hash  Alternative: Pre-compute Hash( nonce | file) and store together with hash Back to root cause of problem: short hash represents file entirely. 14

15 Proofs of Ownership (POWs)
Server preprocesses the file Stores some short information per file (few bytes only) Proof stage: a challenge response – done only during file upload Honest client has access to the file Server has only access to preprocessed information. cannot retrieve files from secondary storage. Must be bandwidth efficient Client computation should be efficient (time & memory) Security definition: Malicious client may have: Partial knowledge of file (file has k min-entropy to it) May receive additional information from accomplices (m bits) If k – m > security parameter, then proof fails whp. s 15 file Prior knowledge k Accomplice data

16 Proofs of Retrievability (PORs)
Role reversal: Server proves to client that it actually store its file Strong extraction based definition (we use a relaxed notion) State of the art solutions all send a pre-processed file to the server. E.g. [NR05],[JK07],[SW08],[DVW09] Cannot be done in our setting In general, POR without preprocessing is a good POW Our first solution is a Merkle tree based POR 16

17 Solution – first attempt
Merkle Tree File 17

18 Solution – first attempt
Preprocessing: server stores root of tree Merkle Tree File 18

19 Solution – first attempt
Proof: server asks client to present paths to t random leaves Merkle Tree √ very efficient File A client which knows only a p fraction of the file, succeeds with prob < pt. 19

20 Problem and solution Merkle Tree Merkle Tree File
Does not suffice when min-entropy is low (e.g. 90% of the file) Solution: Apply tree to an erasure coding of the file Satisfies security of POW and POR. Efficient encoding? Must pay either: Large memory Multiple disk accesses Bad for large files  Merkle Tree Merkle Tree File Erasure code 20

21 Problem and solution A client which knows a large fraction of the blocks (say, 95%), can pass the test with reasonable probability (0.9510=0.6). Solution: Merkle Tree Apply solution to new tree Merkle Tree File Erasure code 21

22 First Solution: Erasure code & Merkle tree
Erasure code property: knowledge of, say, 50% of the encoding suffices to recover original file. Therefore an attacker who does not know even a single byte of the file, does not know > 50% of the encoding. Fails in each Merkle tree query w.p. 50%. Cheating probability is now 2-L Merkle Tree 22

23 Protocols with small space
Limit solution to use an L byte buffer for all the computation For example: L=64MB Relax security guarantees: Can only tolerate L bytes of accomplice data. s L Prior knowledge file Accomplice 23 23

24 Second protocol: hash to small space
First hash file to a buffer of L bytes. Then construct Merkle-tree over the buffer. Reducer: use pairwise-independent hashing Security: POW will fail (w.h.p.) adversary that Has at least k bits min-entropy on the file Receives less than Min(L, k-s) bits from an accomplice Merkle Tree Reduced file Reducer File 24

25 Is this efficient enough ?
Still not really practical File size M Buffer size L Reducer requires Ω(M·L) time  We want to push it further down… 25

26 Third protocol: Reduce and Mix
In Reducer: XOR each block to a constant number of random locations Runs in O(M+L) time Add a mixing phase Merkle Tree Hypothesis: reduce + mix forms a good code Security defined against a generalized block fixing source distribution Reduced & mixed file Mixer Reduced file Reducer File 26

27 Performance of the different phases of the low space PoW
The reading+SHA phases happen also if our solution is not used. The Reducer is the part of our solution which depends on the size of the file. The mixing+merkle do not depend on file size. 27 27

28 When is it worth the effort?

29 Summary Identified security implications of client-side deduplication
Introduced POWs to enable client-side deduplication in the cloud The challenge: offer meaningful privacy guarantees with a limited toll on the resources Merkle Tree Mixer Reducer 29 29

30 Background: Merkle Hash Trees
A method of committing to (by hashing together) n values, x1,…,xn, such that The result is a single hash value For any xi, it is possible to prove that it appeared in the original list, using a proof of length O(log n). a b c d e f g h v00=H(a,b) v01=H(c,d) v10=H(e,f) v11=H(g,h) v0=H(v00,v01) v1=H(v10,v11) v=H(v0,v1) 30

31 Verifying that a value appears in the set
Provide the leaf, and the siblings of the nodes in the path from to the root. (O(log n) values) v=H(v0,v1) v0=H(v00,v01) v1=H(v10,v11) v00=H(a,b) v01=H(c,d) v10=H(e,f) v11=H(g,h) a b c d e f g h 31


Download ppt "Security and Deduplication in the Cloud"

Similar presentations


Ads by Google