Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar.

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)

Preview 2 The setting: data-centric replicated storage –Simple network-attached storage-nodes Our contributions: 1.First distributed reconfigurable R/W storage 2.Asynch. VS. consensus-based reconfiguration Allows to add/remove storage-nodes dynamically

Enterprise Storage Systems Highly reliable customized hardware Controllers, I/O ports may become a bottleneck Expensive Usually not extensible –Different solutions for different scale –Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks) 3

Alternative – Distributed Storage Made up of many storage nodes Unreliable, cheap hardware Failures are the norm, not an exception Challenges: –Achieving reliability and consistency –Supporting reconfigurations 4

Distributed Storage Architecture Unpredictable network delays (asynchrony) Cloud Storage LAN/ WAN read write 5 Storage Clients Dynamic, Fault-prone Fault-prone Storage Nodes

A Case for Data-Centric Replication Client-side code runs replication logic –Communicates with multiple storage nodes Simple storage nodes (servers) –Can be network-attached disks Not necessarily PCs with disks Do not run application-specific code Less fault-prone components –Simply respond to client requests High throughput –Do not communicate with each other If storage-nodes communicate, their failure is likely to be correlated! Oblivious to where other replicas of each object are stored Scalable, same storage node can be used for many replication sets not-so-thin client thin storage node

Real Systems Are Dynamic 7 The challenge: maintain consistency, reliability, availability LAN/ WAN reconfig {–A, –B} A B C D E reconfig {–C, +F,…, +I} F G I H

Pitfall of Naïve Reconfiguration 8 A B C D {A, B, C, D} {A, B, C, D, E} {A, B, C} {A, B, C, D, E} {A, B, C} E delayed reconfig {+E} reconfig {-D} {A, B, C, D, E}

Returns “Italy”! Pitfall of Naïve Reconfiguration 9 A B C D {A, B, C, D, E} {A, B, C} {A, B, C, D, E} {A, B, C} E write x “Spain” read x {A, B, C, D, E} X = “Italy”, 1 X = “Spain”, 2 X = “Italy”, 1 Split Brain!

Reconfiguration Option 1: Centralized Can be automatic –E.g., Ursa Minor [Abd-El-Malek et al., FAST 05] Downtime –Most solutions stop R/W while reconfiguring Single point of failure –What if manager crashes while changing the system? 10 Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am Virtually Yours, Moshe Barak

Reconfiguration Option 2: Distributed Agreement Servers agree on next configuration –Previous solutions not data-centric No downtime In theory, might never terminate [FLP85] In practice, we have partial synchrony so it usually works 11

Reconfiguration Option 3: DynaStore [Aguilera, Keidar, Malkhi, S., PODC09] 12 Distributed & completely asynchronous No downtime Always terminates Not data-centric

In this work: DynaDisk dynamic data-centric R/W storage 13 1.First distributed data-centric solution –No downtime 2.Tunable reconfiguration method –Modular design, coordination is separate from data –Allows easily setting/comparing the coordination method –Consensus-based VS. asynchronous reconfiguration 3.Many shared objects –Running a protocol instance per object too costly –Transferring all state at once might be infeasible –Our solution: incremental state transfer 4.Built with an external (weak) location service –We formally state the requirements from such a service

Location Service Used in practice, ignored in theory We formalize the weak external service as an oracle: Not enough to solve reconfiguration 14 oracle.query( ) returns some “legal” configuration If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration

The Coordination Module in DynaDisk Storage devices in a configuration conf = {+A, +B, +C} z x y next config:  z x y  z x y  A BC Distributed R/W objects Updated similarly to ABD Distributed “weak snapshot” object API: update(set of changes)→OK scan() → set of updates 15

Coordination with Consensus z x y next config:  z x y  z x y  A BC reconfig({–C}) reconfig({+D}) Consensus +D –C +D update : scan: read & write-back next config from majority every scan returns +D or  16

Weak Snapshot – Weaker than consensus No need to agree on the next configuration, as long as each process has a set of possible next configurations, and all such sets intersect –Intersection allows to converge and again use a single config Non-empty intersection property of weak snapshot: –Every two non-empty sets returned by scan( ) intersect –Example: Client 1’s scan Client 2’s scan {+D} {+D} {–C} {+D, –C} {+D} {–C} Consensus 17

Coordination without consensus z x y next config: z y z y A BC reconfig({–C}) reconfig({+D}) update : scan: read & write-back proposals from majority (twice) CAS({–C}, , 0)   +D     CAS({–C}, , 1) +D –C WRITE ({–C}, 0)OK 2 2 2 111 00 0 –C

Tracking Evolving Config’s With consensus: agree on next configuration Without consensus – usually a chain, sometimes a DAG: 19 A, B, C A,B,C,D +D  C A,B A, B, D A, B, C +D  C A,B,C,D A, B, D Inconsistent updates found and merged weak snapshot scan() returns {+D, -C} scan() returns {+D} All non-empty scans intersect

Consensus-based VS. Asynch. Coordination Two implementations of weak snapshots –Asynchronous –Partially synchronous (consensus-based) Active Disk Paxos [Chockler, Malkhi, 2005] Exponential backoff for leader-election Unlike asynchronous coordination, consensus-based might not terminate [FLP85] Storage overhead –Asynchronous: vector of updates vector size ≤ min(#reconfigs, #members in config) –Consensus-based: 4 integers and the chosen update –Per storage device and configuration 20

Strong progress guarantees are not for free Consensus-based Asynchronous (no consensus) Significant negative effect on R/W latency Slightly better,much more predictable reconfig latency when many reconfig execute simultaneously The same when no reconfigurations 21

Future & Ongoing Work Combine asynch. and partially-synch. coordination Consider other weak snapshot implementations –E.g., using randomized consensus Use weak snapshots to reconfigure other services –Not just for R/W 22

Summary DynaDisk – dynamic data-centric R/W storage –First decentralized solution –No downtime –Supports many objects, provides incremental reconfig –Uses one coordination object per config. (not per object) –Tunable reconfiguration method We implemented asynchronous and consensus-based Many other implementations of weak-snapshots possible Asynchronous coordination in practice: –Works in more circumstances → more robust –But, at a cost – significantly affects ongoing R/W ops 23

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar.

Similar presentations

Presentation on theme: "Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar.

Similar presentations

Presentation on theme: "Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar."— Presentation transcript:

Similar presentations

About project

Feedback