Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University.

Slides:



Advertisements
Similar presentations
P. Hunt, M Konar, F. Junqueira, B. Reed Presented by David Stein for ECE598YL SP12.
Advertisements

Apache ZooKeeper By Patrick Hunt, Mahadev Konar
Wait-free coordination for Internet-scale systems
HUG – India Meet November 28, 2009 Noida Apache ZooKeeper Aby Abraham.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Styles of Client/Server Computing.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Distributed Snapshots –Termination detection Election algorithms –Bully –Ring.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
IBM Haifa Research 1 The Cloud Trade Off IBM Haifa Research Storage Systems.
1 The Google File System Reporter: You-Wei Zhang.
1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2.
(Business) Process Centric Exchanges
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms –Bully algorithm.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Introduction to ZooKeeper. Agenda  What is ZooKeeper (ZK)  What ZK can do  How ZK works  ZK interface  What ZK ensures.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Zookeeper Wait-Free Coordination for Internet-Scale Systems.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Apache ZooKeeper CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Detour: Distributed Systems Techniques
강호영 Contents ZooKeeper Overview ZooKeeper’s Performance ZooKeeper’s Reliability ZooKeeper’s Architecture Running Replicated ZooKeeper.
Cloud Computing and Architecuture
CSE 486/586 Distributed Systems Consistency --- 2
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Trade-offs in Cloud Databases
ZooKeeper Claudia Hauff.
Outline Other synchronization primitives
CSE 486/586 Distributed Systems Consistency --- 2
CSE 486/586 Distributed Systems Consistency --- 1
Lecturer : Dr. Pavle Mogin
Other Important Synchronization Primitives
Strong Consistency & CAP Theorem
Strong Consistency & CAP Theorem
Replication Middleware for Cloud Based Storage Service
Strong Consistency & CAP Theorem
Replication and Consistency
EECS 498 Introduction to Distributed Systems Fall 2017
Providing Secure Storage on the Internet
CSE 486/586 Distributed Systems Consistency --- 3
NoSQL Databases An Overview
Implementing Consistency -- Paxos
Replication and Consistency
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 486/586 Distributed Systems Consistency --- 1
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Wait-free coordination for Internet-scale systems
Scalable Causal Consistency
Distributed Systems CS
Transaction Properties: ACID vs. BASE
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Strong Consistency & CAP Theorem
Global Distribution.
CSE 486/586 Distributed Systems Consistency --- 2
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Consistency --- 3
Replication and Consistency
Presentation transcript:

Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University of Iaşi

Contents Time and synchronization (part II) – vector clocks CAP Theorem BASE vs ACID ZooKeeper

Vector Clocks Recap: Problem with Lamport timestamps - if A→B then timestamp(A) < timestamp(B), but if timestamp(A) < timestamp(B), not necessarily A→B

Vector clocks Using multiple time tags per process helps in determining the exact happens-before relation as resulting from tags: each process will use a vector of N values where N = number of processes, initialized with 0 the i-th element of the vector clock is the clock of the i-th process the vector is send along the messages if a process sends a message increments just its own clock in the vector If a process receives a message increments its own clock in the local vector and sets the rest of the clocks to the max(clock value in local vector, clock value in received vector)

Vector Clocks example

Vector Clocks example

Vector Clocks example

Vector Clocks example

Vector Clocks – Causality determination We can establish that there is a causality (happened-before) relation between two events E1 and E2 if V(E1) < V(E2) (where V is the associated vector clock). V(E1) < V(E2) if all clock values in V(E1) <= corresponding clock values in V(E2), and there exist at least one value in V(E1) < corresponding clock value in V(E2). If !(V(E1) < V(E2)) and !(V(E2) < V(E1)) we can label the events as concurrent

The CAP Theorem First formulated by Eric Brewer as a conjecture (PODC 2000). Formal proof by Seth Gilbert and Nancy Lynch (2002). It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (all requests are answered with success or failure within a similar timely manner) Partition tolerance (system continues to work despite arbitrary partitioning due to network fails)

The CAP Theorem Figure source: Concurrent Programming for Scalable Web Architectures, Benjamin Erb

The CAP Theorem – the requirements Consistency – why we need it? Distributed services are used by multiple (up to millions users) Concurrent reads and writes take place in the distributed systems We need an unitary view across all sessions and across all data replicas Example: An online shop stores stock info on a distributed storage system. When an user buys a product, the stock info must be updated on every storage replica and be visible to all users across all sessions.

The CAP Theorem – the requirements Availability – why we need it? Reads and writes should be completed in a timely reliable fashion, for offering the desired level of QoS Often Service Level Agreements (SLAs) are set in commercial environments to establish desirable parameters of functionality that should be mandatorily be provided ½ second delay per page load results in 20% drop in traffic and revenue (Google, Web 2.0 Conference, 2006) Amazon tests by simulating artificial delays resulted in 6M dollars per each ms (2009)

The CAP Theorem – the requirements Partition-tolerance – why we need it? Multiple data centers used by the same distributed service can be partitioned due to various failures: Power outages DNS timeouts Cable failures and others ... Failures, or more generally faults in distributed systems are rather the norm than the exception Let’s say a rack server in a data center has a one downtime in 3 years Try to figure out how often the data center fails on average if it has 100 servers

BASE vs ACID ACID – the traditional guarantee set in RDBMS Atomicity – a transaction either succeds with the change of the database, or fails with a rollback Consistency – there is no invalid state in a sequence of states caused by transactions Isolation – each transaction is isolated from the rest, preventing conflicts between concurrent transactions Durability – each transaction commit is persistent over failures Doesn’t work for distributed storage systems – no coverage for partition tolerance (and this is mandatory)

BASE vs ACID BASE – Basically Available Soft-state Eventual consistency Basically Available – ensures the availability requirement in the CAP theorem Soft-state – there is no strong consistency provided, and the state of the system might be stale at certain times Eventual consistency – eventually the state of the distributed system converges to a consistent view

BASE vs ACID BASE model is mostly used in distributed key-value stores, where availability is typically favored over consistency (NoSQL): Apache Cassandra (originally used by Facebook) Dynamo DB (Amazon) Voldemort (LinkedIn) There are exceptions: HBase (inspired from Google’s Big Table) – favors consistency over availability

ZooKeeper – Distributed Coordination Ground idea: Distributed systems world is like a Zoo, and beasts should be kept on a leash Multiple instances of distributed applications (the same or different apps) often require synchronizing for proper interaction ZooKeeper is a coordination service where: The apps coordinated are distributed The service itself is also distributed Article: ZooKeeper: Wait-free coordination for Internet-scale systems (P. Hunt et al. – USENIX 2010)

ZooKeeper – Distributed Coordination What do we mean by apps requiring synchronization? Various stuff (it’s a Zoo ...): Detecting group membership validity Leader election protocols Mutual exclusive access on shared resources and others ... What do we mean by ZooKeeper providing coordination? ZooKeeper service does not offer complex server side primitives as above for synchronization ZooKeeper service offers a coordination kernel, exposing an API that can be used by clients to implement what primitives they require

ZooKeeper – Guarantees The ZooKeeper coordination kernel provides several guarantees: 1. It is wait-free Let’s stop a bit ... What does this mean (in general)? lock-freedom – at least one system component makes progress (system wide throughput is guaranteed, but some components can starve) wait-freedom – all system components make progress (no individual starvation) 2. It guarantees FIFO ordering for all client operations 3. It guarantees linearizable writes

ZooKeeper – Guarantees (Linearizability) Let’s stop a bit ... What does linearizability mean (in general)? An operation has typically an invocation and a response phase – looking at it atomically the invocation and response are indivisible, but in reality is not exactly like this ... Property of a linearizable operations execution means that: Invocation of operations and responses to them can be reordered without change of the system behavior equivalent to a sequence of atomic execution of operations (a sequential history) The sequential history obtained is semantically correct If an operation response completes in the original order before another operation starts, it will still complete before in the reordering

ZooKeeper – Guarantees (Linearizability) Example (threads): T1.lock(); T2.lock(); T1.fail; T2.ok; Let’s reorder ... T1.lock(); T1.fail; T2.lock(); T2.ok; it is sequential ... ... but not semantically correct T2.lock(); T2.ok; T1.lock(); T1.fail; ... and semantically correct We have b) => the original history is linearizable. Back to ZK, recap: the coordination kernel ensures that application write operations history are linearizable (not the reads).

ZooKeeper – Fundamentals High level overview of the service: Figure source: ZooKeeper Documentation The service interface is exposed to clients through a client library API. Multiple servers can distributedly offer the same coordination service (single system image), among which a leader is defined as part of the internal ZK protocol that ensures consistency.

ZooKeeper – Fundamentals The main abstraction offered by the ZooKeeper client library is a hierarchy of znodes, organized similar to a file system: Figure source: ZooKeeper: Wait-free coordination for Internet-scale systems (P. Hunt et al. – USENIX 2010) Applications can create, delete and change a limited size data content in nodes (default 1MB) to set configuration parameters that are used in the distributed environment.

ZooKeeper – Fundamentals Two types of nodes: Regular – created and deleted explicitly by apps Ephemeral – deletion can be performed automatically when session during which creation occured terminates Nodes can be created with the same base name, but having a sequential flag set, for which the ZK service appends an monotonically increasing number. What’s this good for? same client application (same code), that creates a node to store configuration (e.g., a pub/sub broker) run multiple times in distributed fashion obviously we don’t want to overwrite an existing config node maybe we also need to organize a queue of nodes based on order other applications (various algorithms)

ZooKeeper – Fundamentals Client sessions: applications connect to the ZK service and execute operations as part of a session sessions are ended explicitly by clients or when session timeouts occur (the ZK server does not receive anything from clients for a while) Probably the most important ZooKeeper feature: watches permit application clients to receive notifications about change events on znodes without polling normally set through flags on read type operations one-time triggers: if a watch is triggered once by an event, the watch is removed; to be notified again, the watch should be set again are associated to a session (unregistered once the session ends)

ZooKeeper – Client API (generic form) create (path, data, flags) Creates a znode at the specified path, which is filled with the specified data, and has the type (regular or ephemeral, sequential or normal) specified in the flags. The method returns the node name. delete (path, version) Deletes the znode at the specified path if the node has the specified version (optional, use -1 to ignore). exists (path, watch) Checks if a node exists at the specified path, and optionally sets a watch that triggers a notification when the node will be created, deleted or new data is set on it.

ZooKeeper – Client API (generic form) getData (path, watch) Returns the data at the specified path if that exists, and optionally sets a watch that triggers when new data is set at the znode in the path or the znode is deleted. setData (path, data, version) Sets the specified data at the specified path if the znode exists, optionally just if the node has the specified version. Returns a structure containing various information about the znode. getChildren (path, watch) Returns the set of names of the children of the znode at the specified path, and optionally sets a watch that triggers when either a new child is created or deleted at the path, or the node at the path is deleted.

ZooKeeper – Client API (generic form) sync (path) ZooKeeper offers a single system image to all connected clients (all clients have the same view of znodes no matter at which server are connected); depending on the server where a client is connected, updates might not be always already processed, when the client would execute a read operation; sync waits for all updates to propagate to the server where the client is connected; path parameter is simply ignored.

ZooKeeper – Client API All operations of read and write type (not the sync), have two forms: synchronous execute a single operation and block until this is finalized does not permit other concurrent tasks asynchronous sets a callback for the invoked operation, which is triggered when the operation completes does permit concurrent tasks an order guarantee is preserved for asynchronous callbacks based on their invocation order

ZooKeeper – Example scenario Context (use of ZK, not ZK itself): Distributed applications that have a leader among them responsible with their coordination While the leader changes system configuration, none of the apps should start using the configuration being changed If the leader dies before finishing changing configuration, none of the apps should use the unfinished configuration

ZooKeeper – Example scenario How it’s done (using ZK): The leader designates a /ready path node as flag for ready-to-use configuration, monitored by other apps While changing configuration the leader deletes the /ready node, and creates it back when finished FIFO ordering guarantees that apps are notified by the /ready node creation, only after configuration is finished Looks ok ... ... or not?

ZooKeeper – Example scenario Q: What if one app sees /ready just before being deleted and starts reading the configuration while being changed? A: The app will be notified when /ready is deleted, so it knows new configuration is being set up, and old one is invalid. It just need to reset the (one-time triggered) watch on /ready to find out when is created. Q: What if one app is notified by a configuration change (node /ready deleted), but the app is slow and until setting a new watch, the node is already created and deleted again? A: Target of app’s action is reading/using an actual valid state of configuration (which can be, and is, the latest). Missing previous valid versions should not be critical.