Evaluating a Defragmented DHT Filesystem Jeff Pang Phil Gibbons, Michael Kaminksy, Haifeng Yu, Sinivasan Seshan Intel Research Pittsburgh, CMU
Problem Summary TRADITIONAL DISTRIBUTED HASH TABLE (DHT) Each server responsible for pseudo-random range of ID space Objects are given pseudo-random IDs
Problem Summary DEFRAGMENTED DHT Each server responsible for dynamically balanced range of ID space Objects are given contiguous IDs
Motivation Better availability You depend on fewer servers when accessing your files Better end-to-end performance You don’t have to perform as many DHT lookups when accessing your files
Availability Setup Evaluated via simulation ~250 nodes with 1.5Mbps each Faultload: PlanetLab failure trace (2003) included one 40 node failure event Workload: Harvard NFS trace (2003) primarily home directories used by researchers Compare: Traditional DHT: data placed using consisent hashing Defragmented DHT: data placed contiguously and load balanced dynamically (via Mercury)
Availability Setup Metric: failure rate of user “tasks” Task(i,m) = sequence of accesses with a interarrival threshold of i and max time of m Task(1sec,5min) = sequence of accesses that are spaced no more than 1 sec apart and last no more than 5 minutes Idea: capture notion of “useful unit of work” Not clear what values are right Therefore we evaluated many variations Task(1sec,…) <1sec 5min Task(1sec,5min) …
Availability Results Failure rate of 5 trials Lower is better Note log scale Missing bars have 0 failures Explanation User tasks access 10-20x fewer nodes in the defragmented design
Performance Setup Deploy real implementation virtual nodes with 1.5Mbps (Emulab) Measured global e2e latencies (MIT King) Workload: Harvard NFS Compare: Traditional vs Defragmented Implementation Uses Symphony/Mercury DHTs, respectively Both use TCP for data transport Both employ a Lookup Cache: remembers recently contacted nodes and their DHT ranges
Performance Setup Metric: task(1sec,infinity) speedup Task t takes 200msec in Traditional Task t takes 100msec in Defragmented speedup(t) = 200/100 = 2 Idea: capture speedup for each unit of work that is independent of user think time Note: 1 second interarrival threshold is conservative => tasks are longer Defragmented does better with shorter tasks (next slide)
Performance Setup Accesses within a task may or may not be inter- dependent Task = (A,B,…) App. may read A, then depending on contents of A, read B App. may read A and B regardless of contents Replay trace to capture both extremes Sequential - Each access must complete before starting the next (best for Defragmented) Parallel - All accesses in a task can be submitted in parallel (best for Traditional) [caveat: limited to 15 outstanding]
Performance Results
Other factors: TCP slow start Most tasks are small
Overhead Defragmented design is not free We want to maintain load balance Dynamic load balance => data migration
Conclusions Defragmented DHT Filesystem benefits: Reduces task failures by an order of magnitude Speeds up tasks by % Overhead might be reasonable: 1 byte written = 1.5 bytes transferred Key assumptions: Most tasks are small to medium sized (file systems, web, etc. -- not streaming) Wide area e2e latencies are tolerable
Tommy Maddox Slides
Load Balance
Lookup Traffic
Availability Breakdown
Performance Breakdown
Performance Breakdown 2 With parallel playback, the Defragmented suffers on the small number of very long tasks ignore - due to topology
Maximum Overhead
Other Workloads