Presentation is loading. Please wait.

Presentation is loading. Please wait.

S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield.

Similar presentations


Presentation on theme: "S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield."— Presentation transcript:

1 S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

2 F AILURES IN A D ATACENTER 2

3 T OLERATING F AILURES IN A D ATACENTER Initial idea behind Remus was to tolerate Datacenter level failures. REMUS 3

4 C AN A W HOLE D ATACENTER F AIL ? Yes! It’s a “ Disaster ”! 4

5 D ISASTERS Illustrative Image courtesy of TangoPango, Flickr. “Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.” - Om Malik, GigaOM “ Truck driver in Texas kills all the websites you really use ” … Southlake FD found that he had low blood sugar - valleywag.com 5

6 D ISASTERS.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun

7 D ISASTERS.. 7

8 M ORE F ODDER B ACK H OME “ An explosion … near our server bank … electrical box containing 580 fiber cables. electrical box … was covered in asbestos … mandated the wearing of hazmat suits.... Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function. In other words, the perfect storm. Oh well. S*it happens. ’’ -Dan Empfield, Slowswitch.com - a Gossamer Threads customer. 8

9 D ISASTER R ECOVERY – T HE O LD F ASHIONED W AY Storage replication between a primary and backup site. Manually restore physical servers from backup images. Data Loss and Long Outage periods. Expensive Hardware – Storage Arrays, Replicators, etc. 9

10 Protected Site Recovery Site VirtualCenter Site Recovery Manager VirtualCenter Site Recovery Manager Datastore Groups Array Replication Datastore Groups X S TATE OF THE A RT D ISASTER R ECOVERY VMs offline VMs powered on VMs become unavailable VMs online in Protected Site Source: VMWare Site Recovery Manager – Technical OverviewVMWare Site Recovery Manager – Technical Overview 10

11 P ROBLEMS WITH E XISTING S OLUTIONS Data Loss & Service Disruption (RPO ~15min, RTO ~few hours) Complicated Recovery Planning (e.g. service A needs to be up before B, etc.) Application Level Recovery Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering. 11

12 D ISASTER T OLERANCE AS A S ERVICE ? Our Vision 12

13 O VERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences 13

14 P RIMARY & B ACKUP S ITES 5ms RTT 14

15 F AILOVER & F AILBACK WITHOUT O UTAGE Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Backup Site : Vancouver Complete State Recovery (CPU, disk, memory, network) No Application Level Recovery 15

16 M AIN C ONTRIBUTIONS Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency No Application level recovery RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using Page Delta Compression Disk Read Tracking SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area 16

17 C ONTRIBUTIONS.. 17

18 F AILURE D ETECTION IN R EMUS External Network Primar y NIC1 NIC2 Backup NIC1 NIC2 Checkpoints A pair of independent dedicated NICs carry replication traffic. Backup declares Primary failure only if It cannot reach Primary via NIC 1 and NIC2 It can reach External N/W via NIC1 Failure of Replication link alone results in Backup shutdown. Split Brain occurs only when both NICs/links fail. LAN 18

19 F AILURE D ETECTION IN W IDE A REA D EPLOYMENTS Cannot distinguish between link and node failure. Higher chances of Split Brain as the network is not reliable anymore External Network Primar y NIC1 NIC2 Backup NIC1 NIC2 Checkpoints LAN WAN Primary Datacente r Primary Datacente r Backup Datacente r Backup Datacente r Replication Channel INTERNET 19

20 F AILOVER A RBITRATION Local Quorum of Simple Reachability Detectors. Stewards can be placed on third party clouds. Google App Server implementation with ~100 LoC. Provider/User could have other sophisticated implementations. 20

21 Stewards F AILOVER A RBITRATION.. Replication Stream POLL 1 Primary Quorum Logic Backup Quorum Logic Apriori Steward Set Agreement I need majority to stay alive I need exclusive majority to failover X X X X X POLL 2 POLL 3 POLL 4 POLL 5 POLL 1 POLL 2POLL 3 POLL 4 POLL 5 21

22 N ETWORK F AILOVER WITHOUT S ERVICE I NTERRUPTION Remus – LAN - Gratuitous ARP from Backup Host SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter Need support from upstream ISP(s) at both Datacenters IP Migration achieved through BGP Multi-homing 22

23 N ETWORK F AILOVER WITHOUT S ERVICE I NTERRUPTION.. Internet BCNet (AS-271) VMs Vancouver ( ) Kamloops ( ) AS (stub) ( /24) VMs Primary SiteBackup Site AS (stub) ( /24)  BGP Multi- homing  Replication  Routing traffic to Primary Site  Re-routing traffic to Backup Site on Failover as-path prepend as-path prepend as-path prepend

24 O VERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences 24

25 I want periodic failovers with no downtime! Did you run regression tests ? Failover Works!! More than one failure ? I will have to restart HA! E VALUATION 25

26 R ESTARTING HA Need to Resynchronize Storage. Avoiding Service Downtime requires Online Resynchronization Leverage DRBD –only resynchronizes blocks that have changed Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol. 26

27 R EGRESSION T ESTS Synthetic Workloads to stress test the Replication Pipeline Failovers every 90 minutes Discovered some interesting corner cases Page-table corruptions in memory checkpoints Write-after-write I/O ordering in disk replication 27

28 S ECOND S ITE – T HE C OMPLETE P ICTURE Service Downtime includes timeout for failure detection (10s) Failure Detection Timeout is configurable 4 VMs x 100 Clients/VM 28

29 R EPLICATION B ANDWIDTH C ONSUMPTION 4 VMs x 100 Clients/VM 29

30 D EMO Expect a real disaster (conference demos are not a good idea!) 30

31 A PPLICATION T HROUGHPUT VS. R EPLICATION L ATENCY SPECWeb w/ 100 Clients Kamloops 31

32 R ESOURCE U TILIZATION VS. A PPLICATION L OAD Domain-0 CPU UtilizationBandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients) 32

33 R ESYNCHRONIZATION D ELAYS VS. O UTAGE P ERIOD OLTP Workload 33

34 The user creates a recovery plan which is associated to a single or multiple protection groups S ETUP W ORKFLOW – R ECOVERY S ITE Source: VMWare Site Recovery Manager – Technical OverviewVMWare Site Recovery Manager – Technical Overview 34

35 R ECOVERY P LAN VM Shutdown High Priority VM Recovery Prepare Storage High Priority VM Shutdown Normal Priority VM Recovery Source: VMWare Site Recovery Manager – Technical OverviewVMWare Site Recovery Manager – Technical Overview Low Priority VM Recovery 35


Download ppt "S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield."

Similar presentations


Ads by Google