Prophecy: Using History for High-Throughput Fault Tolerance

Prophecy: Using History for High-Throughput Fault Tolerance
Siddhartha Sen Joint work with Wyatt Lloyd and Mike Freedman Princeton University

Non-crash failures happen
Model as Byzantine (malicious) It’s hard to model non-crash failures. One way we can model them… Independence

Mask Byzantine faults Service Clients
Given this failure model, we then try to mask the failure so the client doesn’t see it. Clients Service

Mask Byzantine faults Throughput Clients Replicated service

Mask Byzantine faults Linearizability (strong consistency) Clients
Throughput Linearizability (strong consistency) Clients Replicated service

Byzantine fault tolerance (BFT)
Low throughput Modifies clients Long-lived sessions

Prophecy High throughput + good consistency No free lunch:
Read-mostly workloads Slightly weakened consistency

Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy

Traditional BFT reads application Agree? … Clients Replica Group

A cache solution cache application Agree? … Clients Replica Group

A cache solution … Problems: Huge cache Invalidation Clients
application Problems: Huge cache Invalidation Agree? … Your cache becomes as large as your disk Clients Replica Group

A compact cache … Clients Replica Group Requests Responses req1 resp1
application Requests Responses req1 resp1 req2 resp2 req3 resp3 … … Clients … Replica Group

A compact cache … Clients Replica Group Requests Responses Requests
application Requests Responses Requests Responses sketch(req1) sketch(resp1) sketch(req2) sketch(resp2) sketch(req3) sketch(resp3) … … Clients … Replica Group

A sketcher … Clients Replica Group sketcher application
That solves the big cache problem. To solve the invalidation problem, we will just make sure to get the full response from one replica. Clients Replica Group

A sketcher sketch webpage ………… ………… … Clients ………… Replica Group

Fast, load-balanced reads
Executing a read sketch webpage ………… Agree? Fast, load-balanced reads ………… ………… … END: Suppose the responses don’t match… how could this happen? Clients ………… Replica Group

Executing a read … Clients Replica Group sketch webpage Agree? …………

Executing a read … key-value store replicated state machine Clients
sketch webpage key-value store ………… replicated state machine ………… … Clients ………… Replica Group

Executing a read … Maintain a fresh cache Clients Replica Group sketch
webpage Agree? ………… Maintain a fresh cache ………… ………… … Clients ………… Replica Group

Did we achieve linearizability?
NO!

Executing a read … Clients Replica Group sketch webpage ………… ………… …………

Executing a read … Clients Replica Group sketch webpage Agree? …………

Executing a read … Fast reads may be stale Clients Replica Group
sketch webpage ………… Agree? Fast reads may be stale ………… ………… … Clients ………… Replica Group

Load balancing … Pr(k stale) = gk Clients Replica Group sketch webpage
………… ………… Agree? Pr(k stale) = gk ………… … Clients ………… Replica Group

D-Prophecy vs. BFT … Traditional BFT: D-Prophecy:
Each replica executes read Linearizability D-Prophecy: One replica executes read “Delay-once” linearizability Clients Replica Group … We call this system D-Prophecy

Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy We’ve solved the throughput problem for read requests at the cost of slightly weakened consistency But we still have two other problems, and these are bad for internet services

Key-exchange overhead
11% 3% Sessions containing only 8 requests, throughput is 1/10 of its maximum

Internet services … Clients Replica Group
So if we apply our current design to internet services, we are faced with the problems of high per-session overhead, and modified clients. We can solve both of these problems by introducing a proxy in between the client and replica group … Clients Replica Group

Consolidate sketchers
A proxy solution Consolidate sketchers Proxy Sketcher … Clients Replica Group

Sketcher must be fail-stop
A proxy solution Sketcher must be fail-stop Sketcher … Clients Trusted Replica Group

A proxy solution … Trust middlebox already Small and simple Clients
Sketcher must be fail-stop Trust middlebox already Small and simple Sketcher Our implementation required less than 3000 LOC … Clients Trusted Replica Group

Fast, load-balanced reads
Prophecy Executing a read Fast, load-balanced reads ………… ………… ………… ………… q Sketcher ………… … Clients Trusted Req Resp s(q)  ………… Replica Group

Prophecy … Fast reads may be stale Clients Replica Group Sketcher Req
………… ………… ………… ………… ………… ………… Sketcher … Clients Trusted Req Resp s(q)  ………… ………… Replica Group

Delay-once linearizability
No assumptions about organization of system state or consistency model of replica group

Delay-once linearizability
Read-after-write property  W, R, W, W, R, R, W, R  No assumptions about organization of system state or consistency model of replica group

Example application Upload embarrassing photos Weak may reorder
1. Remove colleagues from ACL 2. Upload photos 3. (Refresh) Weak may reorder Delay-once preserves order Eventual consistency: updates may appear in different orders at different replicas

Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy

Implementation Modified PBFT C++, Tamer async I/O
PBFT is stable, complete Competitive with Zyzzyva et. al. C++, Tamer async I/O Sketcher: 2000 LOC PBFT library: 1140 LOC PBFT client: 1000 LOC

Evaluation Prophecy vs. proxied-PBFT D-Prophecy vs. PBFT
Proxied systems D-Prophecy vs. PBFT Non-proxied systems

Evaluation Prophecy vs. proxied-PBFT We will study: Proxied systems
Performance on “null” workloads Performance with real replicated service Where system bottlenecks, how to scale

Basic setup Sketcher (concurrent) Clients (100) Replica Group (PBFT)

Fraction of failed fast reads
Alexa top sites: < 15% Fraction of failed fast reads If you have a transition ratio of 1, then every read is being preceded by a write. If you have a transition ratio of 0, then reads are only preceded by other reads.

Small benefit on null reads

Apache webserver setup
Sketcher Clients Replica Group

Large benefit on real workload
3.7x 2.0x Concurrent reads of a 1-byte webpage

Benefit grows with work
94s (Apache) Null workloads are misleading!

Benefit grows with work

Single sketcher bottlenecks
Concurrent reads of a 1-byte webpage

Scaling out First tier does load balancing, second tier does consistent hashing… no more complex than how you’d scale-out and partition a system today

Scales linearly with replicas
Concurrent reads of a 1-byte webpage

Summary Prophecy good for Internet services
Fast, load-balanced reads D-Prophecy good for traditional services Prophecy scales linearly while PBFT stays flat Limitations: Read-mostly workloads (meas. study corroborates) Delay-once linearizability (useful for many apps)

Thank You

Additional slides

Transitions Prophecy good for read-mostly workloads
Are transitions rare in practice?

Measurement study Alexa top sites
Access main page every 20 sec for 24 hrs

Mostly static content

Mostly static content 15%

Dynamic content Rabin fingerprinting on transitions
43% differ by single contiguous change Sampled 4000 of them, over half due to: Load balancing directives Random IDs in links, function parameters “change” = insertion/deletion/substitution (i.e. edit distance 1)

Prophecy: Using History for High-Throughput Fault Tolerance

Similar presentations

Presentation on theme: "Prophecy: Using History for High-Throughput Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prophecy: Using History for High-Throughput Fault Tolerance

Similar presentations

Presentation on theme: "Prophecy: Using History for High-Throughput Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback