Presentation is loading. Please wait.

Presentation is loading. Please wait.

Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI.

Similar presentations


Presentation on theme: "Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI."— Presentation transcript:

1 Haryadi Gunawi UC Berkeley 1

2 2  Research background  FATE and DESTINI

3 3 Local storag e Storage Servers

4 4 e.g. Google Laptop Cloud Storage

5 5 Internet-service FS: GoogleFS, HadoopFS, CloudStore,... Custom Storage: Facebook Haystack Photo Store, Microsoft StarTrack (Map Apps), Amazon S3, EBS,... Key-Value Store: Cassandra, Voldemort,... Structured Storage: Yahoo! PNUTS, Google BigTable, Hbase,... “This is not just data. It’s my life. And I would be sick if I lost it” [CNN ’10] + Replication + Scale-up + Migration +... Reliable? Highly-available?

6 6 Cloudy with a chance of failure

7 7  Research background  FATE and DESTINI  Motivation  FATE  DESTINI  Evaluation Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, Remzi Arpaci- Dusseau, Koushik Sen, Dhruba Borthakur

8 8  Cloud  Thousands of commodity machines  “Rare (HW) failures become frequent” [Hamilton]  Failure recovery  “… has to come from the software” [Dean]  “… must be a first-class op” [Ramakrishnan et al.]

9 9  Google Chubby (lock service)  Four occasions of data loss  Whole-system down  More problems after injecting multiple failures (randomly)  Google BigTable (key-value store)  Chubby bug affects BigTable availability  More details?  How often? Other problems and implications?

10  Open-source cloud projects  Very valuable bug/issue repository!  HDFS (1400+), ZooKeeper (900+), Cassandra (1600+), Hbase (3000+), Hadoop (7000+)  HDFS JIRA study  1300 issues, 4 years (April 2006 to July 2010)  Select recovery problems due to hardware failures  91 recovery bugs/issues 10 ImplicationsCount Data loss13 Unavailability48 Corruption19 Misc.10

11 11  Testing is not advanced enough [Google]  Failure model: multiple, diverse failures  Recovery is under-specified [Hamilton]  Lots of custom recovery  Implementation is complex  Need two advancements:  Exercise complex failure modes  Write specifications and test the implementation

12 Cloud software 12 FATE DESTINI Failure Testing Service Declarative Testing Specifications X2X2 X1X1 Violate specs?

13 13  Research background  FATE and DESTINI  Motivation  FATE  Architecture  Failure exploration  DESTINI  Evaluation

14 14 M1C23 No failures Setup Recovery: Recreate fresh pipeline (1, 2, 4) Data Transfer Recovery: Continue on surviving nodes (1, 2) M1C23 M1C234 Alloc Req Setu p Stag e Setu p Stag e Data Transfe r Data Transfe r X1X1 X2X2 HadoopFS (HDFS) Write Protocol

15 15  Failures  Anytime: different stages  different recovery  Anywhere: N2 crash, and then N3  Any type: bad disks, partitioned nodes/racks  FATE  Systematically exercise multiple, diverse failures  How? need to “remember” failures – via failure IDs M1C23M1C234

16 16  Abstraction of I/O failures  Building failure IDs  Intercept every I/O  Inject possible failures  Ex: crash, network partition, disk failure (LSE/corruption) Node2 Node3 I/O information OutputStream.read() in BlockReceiver.java Net I/O from N3 to N2 “Data Ack” FailureCrash After Failure ID: 25 Note: FIDs A, B, C,...

17 17 Java SDK Target system (e.g. Hadoop FS) Workload Driver while (new FIDs) { hdfs.write() } AspectJ Failure Surface Failure Server (fail/no fail?) I/O Info

18 18 M1C23 A A B A BC Exp #1: A Exp #2: B Exp #3: C M1C23 A BC B A A AB AC B C BC 1 failure / run 2 failures / run

19 19  Research background  FATE and DESTINI  Motivation  FATE  Architecture  Failure exploration challenge and solution  DESTINI  Evaluation

20 20  Exercised over 40,000 unique combinations of 1, 2, and 3 failures per run  80 hours of testing time!  New challenge: Combinatorial explosion  Need smarter exploration strategies 123 A1 A2 A1 B2 B1 A2 B1 B failures / run A1A1 B1B1 A2A2 B2B2 A3A3 B3B3

21 21  Properties of multiple failures  Pairwise dependent failures  Pairwise independent failures  Goal: exercise distinct recovery behaviors  Key: some failures result in similar recovery  Result: > 10x faster, and found the same bugs

22 22  Failure dependency graph  Inject single failures first  Record subsequent dependent IDs  Ex: X depends on A  Brute-force: AX, BX, CX, DX, CY, DY  Recovery clustering  Two clusters: {X} and {X, Y}  Only exercise distinct clusters  Pick a failureID that triggers the recovery cluster  Results: AX, CX, CY FID  Subseq FIDs A  X B  X C  X, Y D  X, Y A BC D X Y

23 23  Independent combinations  FP 2 x N (N – 1)  FP = 2, N = 3, Tot = 24  Symmetric code  Just pick two nodes  N (N – 1)  2  FP 2 x A1A1 B1B1 A2A2 B2B2 A3A3 B3B3 123 A1A1 B1B1 A2A2 B2B2 A3A3 B3B3

24 24  FP 2 bottleneck  FP = 4, total = 16  Real example, FP = 15  Recovery clustering  Cluster A and B if: fail(A) == fail(B)  Reduce FP 2 to FP 2 clustered  E.g.15 FPs to 8 FPs clustered A1A1 B1B1 A2A2 B2B2 C1C1 D1D1 C2C2 D2D2 A1A1 B1B1 A2A2 B2B2 C1C1 D1D1 C2C2 D2D2

25 25  Exercise multiple, diverse failures  Via failure IDs abstraction  Challenge: combinatorial explosion of multiple failures  Smart failure exploration strategies  > 10x improvement  Built on top of failure IDs abstraction  Current limitations  No I/O reordering  Manual workload setup

26 26  Research background  FATE and DESTINI  Motivation  FATE  DESTINI  Overview  Building specifications  Evaluation

27 27  Is the system correct under failures?  Need to write specifications [It is] great to document (in a spec) the HDFS write protocol... …, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently. [It is] great to document (in a spec) the HDFS write protocol... …, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently. Implemen- tation Implemen- tation Specs X1X1 X2X2

28 28  How to write specifications?  Developer friendly (clear, concise, easy)  Existing approaches:  Unit test (ugly, bloated, not formal)  Others too verbose and long  Declarative relational logic language (Datalog)  Key: easy to express logical relations

29 29  How to write specs?  Violations  Expectations  Facts  How to write recovery specs?  “... recovery is under specified” [Hamilton]  Precise failure events  Precise check timings  How to test implementation?  Interpose I/O calls (lightweight)  Deduce expectations and facts from I/O events Implemen- tation Implemen- tation Specs

30 30  Research background  FATE and DESTINI  Motivation  FATE  DESTINI  Overview  Building specifications  Evaluation

31 31 violationTable(…) :- expectationTable(…), NOT-IN actualTable(…) “Throw a violation if an expectation is different from the actual behavior” Datalog syntax: head() :- predicates(), … :- derivation, AND

32 32 expectedNodes (Block, Node) BNode 1 BNode 2 actualNodes (Block, Node) BNode 1 BNode 2 incorrectNodes (Block, Node) incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); M1C23 X BB Data Transfe r Data Transfe r “Replicas should exist in surviving nodes”

33 33 expectedNodes (Block, Node) BNode 1 BNode 2 actualNodes (Block, Node) BNode 1 incorrectNodes (Block, Node) BNode 2 M1C23 X BB incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

34 34  Ex: which nodes should have the blocks?  Deduce expectations from I/O events (italic) MC getBlockPipe(…) Give me 3 nodes for B [Node1, Node2, Node3] M C expectedNodes (Block, Node) BNode 1 BNode 2 BNode 3 expectedNodes (B, N) :- getBlockPipe (B, N); 123 X2X2 #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

35 35 DEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode (N) expectedNodes (Block, Node) BNode 1 BNode 2 BNode 3 M1C23 X BB #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); #2: expectedNodes(B, N) :- getBlockPipe(B,N)

36 36 DEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode (N), writeStage (B, Stage), Stage == “Data Transfer”; Different stages  different recovery behaviors #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N) #2: expectedNodes(B,N) :- getBlockPipe(B,N) #3: expectedNodes(B,N):- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage == “DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT ):- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe (B, N); #7: acksCnt (B, CNT ) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck (B, P, A); M1C23 M1C234 Data transfer recovery vs. Setup stage recovery Precise failure events 

37 37  Ex: which nodes actually store the valid blocks?  Deduced from disk I/O events in 8 datalog rules actualNodes (Block, Node) BNode 1 M1C23 X B #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

38 38  Recovery ≠ invariant  If recovery is ongoing, invariants are violated  Don’t want false alarms  Need precise check timings  Ex: upon block completion #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), completeBlock (B);

39 39 r1incorrectNodes (B, N):-cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N); r2pipeNodes (B, Pos, N):-getBlkPipe (UFile, B, Gs, Pos, N); r3expectedNodes (B, N):-getBlkPipe (UFile, B, Gs, Pos, N); r4DEL expectedNodes (B, N):-fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N); r5setupAcks (B, Pos, Ack):-cdpSetupAck (B, Pos, Ack); r6goodAcksCnt (B, COUNT ):-setupAcks (B, Pos, Ack), Ack == ’OK’; r7nodesCnt (B, COUNT ):-pipeNodes (B,, N, ); r8pipeStage (B, Stg):-nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2; r9blkGenStamp (B, Gs):-dnpNextGenStamp (B, Gs); r10blkGenStamp (B, Gs):-cnpGetBlkPipe (UFile, B, Gs,, ); r11diskFiles (N, File):-fsCreate (N, File); r12diskFiles (N, Dst):-fsRename (N, Src, Dst), diskFiles (N, Src, Type); r13DEL diskFiles (N, Src):-fsRename (N, Src, Dst), diskFiles (N, Src, Type); r14fileTypes (N, File, Type):-diskFiles(N, File), Type := Util.getType(File); r15blkMetas (N, B, Gs):-fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File); r16actualNodes (B, N):-blkMetas (N, B, Gs), blkGenStamp (B, Gs); Failure event (crash) Expectation Actual Facts I/O Events

40 40  The spec: “something is wrong”  Why? Where is the bug?  Let’s write more detailed specs M1C23 X BB M1C23 X B

41 41  First analysis  Client’s pipeline excludes Node2, why?  Maybe, client gets a bad ack for Node2 errBadAck (N) :- dataAck (N, “Error”), liveNodes (N)  Second analysis  Client gets a bad ack for Node2, why?  Maybe, Node1 could not communicate to Node2 errBadConnect (N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN)  We catch the bug!  Node2 cannot talk to Node3 (crash)  Node2 terminates all connections (including Node1!)  Node1 thinks Node2 is dead X B M1C23

42 42  More detailed specs  Catch bugs closer to the source and earlier in time Global view Client’s view Datanode’s view

43 43  Design patterns  Add detailed specs  Refine existing specs  Write specs from different views (global, client, dn)  Incorporate diverse failures (crashes, net partitions)  Express different violations (data-loss, unavailability)

44 44  Research background  FATE and DESTINI  Motivation  FATE  DESTINI  Evaluation

45 45  Implementation complexity  ~6000 LOC in Java  Target 3 popular cloud systems  HadoopFS (primary target)  Underlying storage for Hadoop/MapReduce  ZooKeeper  Distributed synchronization service  Cassandra  Distributed key-value store  Recovery bugs  Found 22 new HDFS bugs (confirmed)  Data loss, unavailability bugs  Reproduced 51 old bugs

46 46 “If multiple racks are available (reachable), a block should be stored in a minimum of two racks” Rack #1Rack #2 B B B Client B Replication Monitor errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_); “Throw a violation if a block is only stored in one rack, but the rack is connected to another rack” Availability bug! #replicas = 3, locations are not checked, B is not migrated to R2 rackCn t B, 1 blkRacks B, R1 connected R1, R2 errorSingleRac k B  FATE injects rack partitioning

47 47  Reduce #experiments by an order of magnitude  Each experiment = 4-9 seconds  Found the same number of bugs  (by experience) # Exps Write + 2 crashes Append + 2 crashes Brute Force Pruned Write + 3 crashes Append + 3 crashes

48 48  Compared to other related work Framework#ChksLines/Chk D3S [NSDI ’08]1053 Pip [NSDI ’06]4443 WiDS [NSDI ’07]1522 P2 Monitor [EuroSys ’06] 1112 DESTINI745

49 49  Cloud systems  All good, but must manage failures  Performance, reliability, availability depend on failure recovery  FATE and DESTINI  Explore multiple, diverse failures systematically  Facilitate declarative recovery specifications  A unified framework  Real-world adoption in progress

50 50  Research background  FATE and DESTINI  Thanks! Questions?


Download ppt "Haryadi Gunawi UC Berkeley 1. 2  Research background  FATE and DESTINI."

Similar presentations


Ads by Google