Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Similar presentations


Presentation on theme: "Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike."— Presentation transcript:

1 Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

2 Salus’ goal Usage: – Provide remote disks to users (Amazon EBS) Scalability – Thousands of machines Robustness – Tolerate disk/memory corruption, CPU errors, … Good performance

3 Scalability and robustness Operating System Distributed Protocol BigTable: 1 corruption/5TB of data?

4 Challenge: Parallelism vs Consistency Metadata server Storage servers Clients Infrequent metadata transfer Parallel data transfer Data is replicated for durability and availability State-of-the-art architecture: GFS/Bigtable, HDFS/HBase, WAS, …

5 Challenges Write in parallel and in order Eliminate single point of failures – Prevent a single node from corrupting data – Read safely from one node Do not increase replication cost

6 Write in parallel and in order Metadata server Data servers Clients

7 Write in parallel and in order Write 1Write 2 Write 2 is committed but write 1 is not. Not allowed for block store.

8 Prevent a single node from corrupting data Metadata server Data servers Clients

9 Prevent a single node from corrupting data Tasks of computation nodes: – Data forwarding, garbage collection, etc Examples of computation nodes: – Tablet server (Bigtable), Region server (HBase), … (WAS) Computation node

10 Read safely from one node Read is executed on one node: – Maximize parallelism – Minimize latency If that node experiences corruptions, …

11 Do not increase replication cost Industrial systems: – Write to f+1 nodes and read from one node BFT systems: – Write to 3f+1 nodes and read from 2f+1 nodes

12 Salus’ approach Start from a scalable architecture (Bigtable/HBase) Ensure robustness techniques do not hurt scalability

13 Salus’ key ideas Pipelined commit – Guarantee ordering despite parallel writes Active storage – Prevent a computation node from corrupting data End-to-end verification – Read safely from one node

14 Salus’ key ideas Metadata server Clients Pipelined commit Active storage End-to-end verification

15 Pipelined commit Goal: barrier semantic – A request can be marked as a barrier. – All previous ones must be executed before it. Naïve solution: – The client blocks at a barrier: lose parallelism A weaker version of distributed transaction – Well-known solution: two phase commit (2PC)

16 Pipelined commit – 2PC 123 45 13 2 45 Previous leader Prepared Committed Client Servers Leader Prepared Leader Batch i Batch i+1

17 Pipelined commit – 2PC 123 45 13 2 45 Previous leader Batch i-1 committed Client Servers Leader Commit Batch i committed Commit Leader Batch i Batch i+1

18 Pipelined commit - challenge Is 2PC slow? – Additional network messages? Disk is the bottleneck. – Additional disk write? Let’s eliminate that. – Challenge: whether to commit a write after recovery 13 2 2 is prepared. Should it be committed? Both cases are possible. Salus’ solution: ask other nodes – Has anyone committed 3 or larger? If not, is 1 committed?

19 Active Storage Goal: a single node cannot corrupt data Well-known solution: replication – Problem: replication cost vs availablity Salus’ solution: use f+1 replicas – Require unanimous consent of the whole quorum – If one replica fails, replace the whole quorum

20 Active Storage Computation node Storage nodes

21 Active Storage Computation nodes Storage nodes Unanimous consent: – All updates must be agreed by f+1 computation nodes. Additional benefit: reduce network bandwidth usage

22 Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum

23 Active Storage Computation nodes Storage nodes What if one computation node fails? – Problem: we may not know which one is faulty. Replace the whole quorum – The new quorum must agree on the states.

24 Active Storage Does it provide BFT with f+1 replication? No …. During recovery, may accept stale states if: – The client fails; – At least one storage node provides stale states; – All other storage nodes are not available. 2f+1 replicas can eliminate this case: – Is it worth adding f replicas to eliminate that?

25 End-to-end verification Goal: read safely from one node – The client should be able to verify the reply. – If corrupted, the client retries another node. Well-known solution: Merkle tree – Problem: scalability Salus’ solution: – Single writer – Distribute the tree among servers

26 End-to-end verification Server 1 Server 2 Server 3 Server 4 Client maintains the top tree. Client does not need to store anything persistently. It can rebuild the top tree from the servers.

27 Recovery Pipelined commit – How to ensure write order after recovery? Active storage: – How to agree on the current states? End-to-end verification – How to rebuild Merkle tree if client recovers?

28 Discussion – why HBase? It’s a popular architecture – Bigtable: Google – HBase: Facebook, Yahoo, … – Windows Azure Storage: Microsoft It’s open source. Why two layers? – Necessary if storage layer is append-only Why append-only storage layer? – Better random write performance – Easy to scale

29 Discussion – multiple writers?

30 Lessons Strong checking makes debugging easier.

31 Evaluation

32

33

34 Challenge: Combining robustness and scalability Scalable systems (GFS/Bigtable, HDFS/HBase, WAS, Spanner, FDS, …..) Strong protections (End-to- end checks, BFT, Depot, …) Combining them is challenging.


Download ppt "Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike."

Similar presentations


Ads by Google