Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG:

Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC indy@illinois.edu FuDiCo 2015 DPRG: http://dprg.cs.uiuc.eduhttp://dprg.cs.uiuc.edu 1

Joint Work With Muntasir Rahman (Graduating PhD Student) Luke Leslie, Lewis Tseng Mayank Pundir (MS, now at Facebook) Work funded by Air Force Research Labs/AFOSR, National Science Foundation, Google, Yahoo!, and Microsoft

Hard Choices in Extensible Distributed Systems Users in extensible distributed systems desire Timeliness and Correctness Guarantees But these are at odds with… Unpredictability Network Delays and Failures Research community and industry often tends to translate this into hard choices in systems design Examples 1.CAP Theorem: choice between consistency and availability (or latency) Either relational databases or eventually consistent NoSQL stores (Maybe a convergence now?) 2.Always get 100% answers in computation engines (batch or stream) Use checkpointing

Hard Choices… Can in fact be Probabilistic Choices! Many of these are in fact probabilistic choices One of the earliest examples: pbcast/Bimodal Multicast Examples 1.CAP Theorem: We derive a probabilistic CAP theorem that defines an achievable boundary between consistency and latency in any database system We use this to incorporate probabilistic consistency and latency SLAs into Cassandra and Riak 2.Always get 100% answers in computation engines (batch or stream) In many systems, checkpointing results in 8-31x higher execution time! We show that in systems like distributed graph processing systems We can avoid checkpointing altogether Instead, have a reactive approach: upon failure, reactively scrounge state (naturally replicated) And achieve very high accuracy (95-99%)

Key-value/NoSQL Storage Systems Key-value/NoSQL stores: $3.4B sector by 2018 Distributed storage in the cloud Netflix: video position (Cassandra) Amazon: shopping cart (DynamoDB) And many others Necessary API operations: get(key) and put(key, value) And some extended operations, e.g., “CQL” in Cassandra key-value store

Key-value/NoSQL Storage: Fast and Fresh Cloud clients expect both Latency: Low latency for all operations (reads/writes) 500ms latency increase at Google.com costs 20% drop in revenue each extra ms  $4 M revenue loss Long latency  User Cognitive Drift Consistency: read returns value of one of latest writes Freshness of data means accurate tracking and higher user satisfaction Most KV stores only offer weak consistency (Eventual consistency) Eventual consistency = if writes stop, all replicas converge, eventually

Hard vs. Soft Partitions CAP Theorem looks at hard partitions However, soft partitions may happen inside a data-center Periods of elevated message delays Periods of elevated loss rates Soft partitions are more frequent Data-center 1 (America) Data-center 2 (Europe) Hard partition ToR CoreSw Congestion at switches => Soft partition

Our work: From Impossibility to Possibility C  Probabilistic C (Consistency) A  Probabilistic A (Latency) P  Probabilistic P (Partition Model) A probabilistic CAP theorem A system that validates how close we are to the achievable envelope (Goal is not: another consistency model, or NoSQL vs New/Yes SQL) 8

time W(1) W(2) R(1) tctc A read is t c -fresh if it returns the value of a write that starts at-most t c time before the read p ic is likelihood a read is NOT t c -fresh Probabilistic Consistency (p ic,t c ) p ua is likelihood a read DOES NOT return an answer within t a time units Probabilistic Latency (p ua,t a ) α is likelihood that a random path ( client  server  client) has message delay exceeding t p time units Probabilistic Partition (α, t p ) PCAP Theorem: Impossible to achieve both Probabilistic Consistency and Latency under Probabilistic Partitions if: t c + t a < t p and p ua + p ic < α Bad network -> High (α, t p ) To get better consistency -> lower (p ic,t c ) To get better latency -> lower (p ua,t a ) Probabilistic CAP 9 Full proof in our arXiv paper: http://arxiv.org/abs/1509.02464 Special case: Original CAP has α=1 and t p = ∞

10 Towards Probabilistic SLAs Latency SLA: Similar to latency SLAs already existing in industry. Meet a desired probability that client receives operation’s result within the timeout Maximize freshness probability within given freshness interval Example: Amazon shopping cart Doesn’t want to lose customers due to high latency Only 10% operations can take longer than 300ms SLA: (p ua, t a ) = (0.1, 300ms) Minimize staleness (don’t want customers to lose items) Minimize: p ic (Given: t c )

11 Towards Probabilistic SLAs (2) Consistency SLA: Goal is to Meet a desired freshness probability (given freshness interval) Maximize probability that client receives operation’s result within the timeout Example: Google search application/Twitter search Wants users to receive “recent” data as search Only 10% results can be more than 5 min stale SLA: (p ic, t c )=(0.1, 5 min) Minimize response time (fast response to query) Minimize: p ua (Given: t a )

Meeting these SLAs: PCAP Systems Increased KnobLatencyConsistency Read DelayDegradesImproves Read Repair RateUnaffectedImproves Consistency LevelDegradesImproves Continuously adapt control knobs to always satisfy PCAP SLA KV-store (Cassandra, Riak) CONTROL KNOBS PCAP System Satisfies PCAP SLA ADAPTIVE CONTROL System assumptions: Client sends query to coordinator server which then forwards to replicas (answers reverse path) There exist background mechanisms to bring stale replicas up to date

Meeting Consistency SLA for PCAP Cassandra (p ic =0.135) Consistency always below target SLA Setup 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM 100 Mbps Ethernet YCSB workload (144 client threads) Network delay: Log-normal distribution [Benson 2010] Mean latency = 3 ms | 4 ms | 5 ms

Meeting Consistency SLA for PCAP Cassandra (p ic =0.135) Optimal envelopes under different Network conditions (based on PCAP theorems) PCAP system Satisfies SLA and close to Optimal envelope

Geo-Distributed PCAP 15 N(20,sqrt(2)) | N(22,sqrt(2.2) Latency SLA met before and after jump Consistency degrades after delay jump Fast convergence initially, and after delay jump Reduced oscillation, compared to multiplicative controller PCAP multiplicative controller

Related Work Pileus/Tuba [Doug Terry et al] Utility-based SLAs Focus on wide-area Can be used underneath our PCAP system (instead of our SLAs) Consistency Metrics: PBS [Peter Bailis et al] Considers write end time (we consider write start time) May not be able to define consistency for some read-write pairs (PCAP accommodates all combinations) Can use it in PCAP system Approximate answers: Hadoop [ApproxHadoop], Querying [BlinkDB], Bimodal multicast 16

PCAP Summary CAP Theorem motivated NoSQL Revolution But apps need freshness + fast responses Under soft partition We proposed Probabilistic models for C, A, P Probabilistic CAP theorem – generalizes classical CAP PCAP system satisfies Latency/Consistency SLAs Integrated into Apache Cassandra and Riak KV stores Riak has expressed interest in incorporating these into their mainline code 17

Distributed Graph Processing and Checkpointing Checkpointing: Proactively save state to persistent storage If there’s a failure, recover 100% cost Used by : PowerGraph [Gonzalez et al. OSDI 2012] Giraph [Apache Giraph] Distributed GraphLab [Low et al. VLDB 2012] Hama [Seo et al. CloudCom 2010] 18

Checkpointing Bad 19 Graph Dataset Vertex Count Edge Count CA-Road1.96 M2.77 M Twitter41.65 M1.47 B UK Web105.9 M3.74 B 8x  31x  19

Users Already Don’t (Use or Like) Checkpointing “While we could turn on checkpointing to handle some of these failures, in practice we choose to disable checkpointing.” [Ching et. al. (Giraph @ Facebook) VLDB 2015] “Existing graph systems only support checkpoint-based fault tolerance, which most users leave disabled due to performance overhead.” [Gonzalez et. al. (GraphX) OSDI 2014] “The choice of interval must balance the cost of constructing the checkpoint with the computation lost since the last checkpoint in the event of a failure.” [Low et. al. (GraphLab) VLDB 2012] “Better performance can be obtained by balancing fault tolerance costs against that of a job restart.” [Low et al. (GraphLab) VLDB 2012] 20

Our Approach: Zorro No checkpointing. Common case is fast. When failure occurs, opportunistically scrounge state (from surviving servers) and continue computation Natural replication in distributed processing systems A vertex data is present at its neighbor vertices Each vertex assigned to one server, and its neighbors likely on other servers We get very high accuracy (95%+) 21

Natural Replication => Can Retrieve a Lot of State 22 PowerGraph LFGraph 92 – 95% 87 – 91% 22

Natural Replication => Low InAccuracy 23 PowerGraphLFGraph 2% 3%

Natural Replication => Low InAccuracy 24 AlgorithmPowerGraphLFGraph PageRank2 %3 % Single-Source Shortest Paths0.0025 %0.06 % Connected Components1.6 %2.15 % K-Core0.0054%1.4 % Graph Coloring*5.02 %NA Group-Source Shortest Paths*0.84 %NA Triangle Count*0 %NA Approximate Diameter*0 %NA

Takeaways Impossibility theorems and 100% correct answers are great But they entail Inflexibility in design (NoSQL or SQL) High overhead (Checkpointing) Important to explore Probabilistic tradeoffs and Achievable envelopes Leads to more flexibility in design Other applicable areas: stream processing, machine learning DPRG: http://dprg.cs.uiuc.eduhttp://dprg.cs.uiuc.edu

Plug: MOOC on “Cloud Computing Concepts” Free course, On Coursera Ran Feb-Apr 2015 120K+ students Next run: Spring 2016 Covered distributed systems and algorithms used in cloud computing Free and Open to everyone https://www.coursera.org/course/cloudcomputing Or do a search on Google for “Cloud Computing Course” (click on first link)

Backup Slides

28 PCAP Consistency Metric Is more Generic Than PBS time W(1) W(2) R(1 ) tctc A read is t c -fresh if it returns the value of a write that starts at-most t c time before the read starts W(1) and R(1) can overlap time W(1) W(2) R(1 ) tctc A read is t c -fresh if it returns the value of a write that starts at-most t c time before the read ends W(1) and R(1) cannot overlap PCAP PBS

GeoPCAP: 2 Key Techniques Client Read, SLA Prob C 1, L 1 Local DC Composed model Prob C C, L C Compare SLA Given client C or L SLA: QUICKEST: at-least one DC satisfies SLA ALL: each DC satisfies SLA Prob C 2, L 2 Prob C 3, L 3 (1) Prob Composition Rules Prob WAN Model

CAP Theorem  NoSQL Revolution Conjectured: [Brewer 00] Proved: [Gilbert Lynch 02] Kicked off NoSQL revolution Abadi’s PACELC If P, choose A or C Else, choose L (latency) or C Consistency Partition-toleranceAvailability (Latency) RDBMSs (non-replicated) Cassandra, RIAK, Dynamo, Voldemort HBase, HyperTable, BigTable, Spanner

Geo-Distributed PCAP 31 N(20,sqrt(2)) | N(22,sqrt(2.2) Latency SLA met before and after jump Consistency degrades after delay jump Fast convergence initially, and after delay jump Reduced oscillation, compared to multiplicative controller PCAP multiplicative controller

Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG:

Similar presentations

Presentation on theme: "Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG:

Similar presentations

Presentation on theme: "Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG:"— Presentation transcript:

Similar presentations

About project

Feedback