Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University.

Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper
Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University of Iaşi

Contents Time and synchronization (part II) – vector clocks
CAP Theorem BASE vs ACID ZooKeeper

Vector Clocks Recap: Problem with Lamport timestamps - if A→B then timestamp(A) < timestamp(B), but if timestamp(A) < timestamp(B), not necessarily A→B

Vector clocks Using multiple time tags per process helps in determining the exact happens-before relation as resulting from tags: each process will use a vector of N values where N = number of processes, initialized with 0 the i-th element of the vector clock is the clock of the i-th process the vector is send along the messages if a process sends a message increments just its own clock in the vector If a process receives a message increments its own clock in the local vector and sets the rest of the clocks to the max(clock value in local vector, clock value in received vector)

Vector Clocks example

Vector Clocks – Causality determination
We can establish that there is a causality (happened-before) relation between two events E1 and E2 if V(E1) < V(E2) (where V is the associated vector clock). V(E1) < V(E2) if all clock values in V(E1) <= corresponding clock values in V(E2), and there exist at least one value in V(E1) < corresponding clock value in V(E2). If !(V(E1) < V(E2)) and !(V(E2) < V(E1)) we can label the events as concurrent

The CAP Theorem First formulated by Eric Brewer as a conjecture (PODC 2000). Formal proof by Seth Gilbert and Nancy Lynch (2002). It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (all requests are answered with success or failure within a similar timely manner) Partition tolerance (system continues to work despite arbitrary partitioning due to network fails)

The CAP Theorem Figure source: Concurrent Programming for Scalable Web Architectures, Benjamin Erb

The CAP Theorem – the requirements
Consistency – why we need it? Distributed services are used by multiple (up to millions users) Concurrent reads and writes take place in the distributed systems We need an unitary view across all sessions and across all data replicas Example: An online shop stores stock info on a distributed storage system. When an user buys a product, the stock info must be updated on every storage replica and be visible to all users across all sessions.

Availability – why we need it? Reads and writes should be completed in a timely reliable fashion, for offering the desired level of QoS Often Service Level Agreements (SLAs) are set in commercial environments to establish desirable parameters of functionality that should be mandatorily be provided ½ second delay per page load results in 20% drop in traffic and revenue (Google, Web 2.0 Conference, 2006) Amazon tests by simulating artificial delays resulted in 6M dollars per each ms (2009)

Partition-tolerance – why we need it? Multiple data centers used by the same distributed service can be partitioned due to various failures: Power outages DNS timeouts Cable failures and others ... Failures, or more generally faults in distributed systems are rather the norm than the exception Let’s say a rack server in a data center has a one downtime in 3 years Try to figure out how often the data center fails on average if it has 100 servers

BASE vs ACID ACID – the traditional guarantee set in RDBMS
Atomicity – a transaction either succeds with the change of the database, or fails with a rollback Consistency – there is no invalid state in a sequence of states caused by transactions Isolation – each transaction is isolated from the rest, preventing conflicts between concurrent transactions Durability – each transaction commit is persistent over failures Doesn’t work for distributed storage systems – no coverage for partition tolerance (and this is mandatory)

BASE vs ACID BASE – Basically Available Soft-state Eventual consistency Basically Available – ensures the availability requirement in the CAP theorem Soft-state – there is no strong consistency provided, and the state of the system might be stale at certain times Eventual consistency – eventually the state of the distributed system converges to a consistent view

BASE vs ACID BASE model is mostly used in distributed key-value stores, where availability is typically favored over consistency (NoSQL): Apache Cassandra (originally used by Facebook) Dynamo DB (Amazon) Voldemort (LinkedIn) There are exceptions: HBase (inspired from Google’s Big Table) – favors consistency over availability

ZooKeeper – Distributed Coordination
Ground idea: Distributed systems world is like a Zoo, and beasts should be kept on a leash Multiple instances of distributed applications (the same or different apps) often require synchronizing for proper interaction ZooKeeper is a coordination service where: The apps coordinated are distributed The service itself is also distributed Article: ZooKeeper: Wait-free coordination for Internet-scale systems (P. Hunt et al. – USENIX 2010)

ZooKeeper – Distributed Coordination
What do we mean by apps requiring synchronization? Various stuff (it’s a Zoo ...): Detecting group membership validity Leader election protocols Mutual exclusive access on shared resources and others ... What do we mean by ZooKeeper providing coordination? ZooKeeper service does not offer complex server side primitives as above for synchronization ZooKeeper service offers a coordination kernel, exposing an API that can be used by clients to implement what primitives they require

ZooKeeper – Guarantees
The ZooKeeper coordination kernel provides several guarantees: 1. It is wait-free Let’s stop a bit ... What does this mean (in general)? lock-freedom – at least one system component makes progress (system wide throughput is guaranteed, but some components can starve) wait-freedom – all system components make progress (no individual starvation) 2. It guarantees FIFO ordering for all client operations 3. It guarantees linearizable writes

ZooKeeper – Guarantees (Linearizability)
Let’s stop a bit ... What does linearizability mean (in general)? An operation has typically an invocation and a response phase – looking at it atomically the invocation and response are indivisible, but in reality is not exactly like this ... Property of a linearizable operations execution means that: Invocation of operations and responses to them can be reordered without change of the system behavior equivalent to a sequence of atomic execution of operations (a sequential history) The sequential history obtained is semantically correct If an operation response completes in the original order before another operation starts, it will still complete before in the reordering

ZooKeeper – Guarantees (Linearizability)
Example (threads): T1.lock(); T2.lock(); T1.fail; T2.ok; Let’s reorder ... T1.lock(); T1.fail; T2.lock(); T2.ok; it is sequential ... ... but not semantically correct T2.lock(); T2.ok; T1.lock(); T1.fail; ... and semantically correct We have b) => the original history is linearizable. Back to ZK, recap: the coordination kernel ensures that application write operations history are linearizable (not the reads).

ZooKeeper – Fundamentals
High level overview of the service: Figure source: ZooKeeper Documentation The service interface is exposed to clients through a client library API. Multiple servers can distributedly offer the same coordination service (single system image), among which a leader is defined as part of the internal ZK protocol that ensures consistency.

The main abstraction offered by the ZooKeeper client library is a hierarchy of znodes, organized similar to a file system: Figure source: ZooKeeper: Wait-free coordination for Internet-scale systems (P. Hunt et al. – USENIX 2010) Applications can create, delete and change a limited size data content in nodes (default 1MB) to set configuration parameters that are used in the distributed environment.

Two types of nodes: Regular – created and deleted explicitly by apps Ephemeral – deletion can be performed automatically when session during which creation occured terminates Nodes can be created with the same base name, but having a sequential flag set, for which the ZK service appends an monotonically increasing number. What’s this good for? same client application (same code), that creates a node to store configuration (e.g., a pub/sub broker) run multiple times in distributed fashion obviously we don’t want to overwrite an existing config node maybe we also need to organize a queue of nodes based on order other applications (various algorithms)

Client sessions: applications connect to the ZK service and execute operations as part of a session sessions are ended explicitly by clients or when session timeouts occur (the ZK server does not receive anything from clients for a while) Probably the most important ZooKeeper feature: watches permit application clients to receive notifications about change events on znodes without polling normally set through flags on read type operations one-time triggers: if a watch is triggered once by an event, the watch is removed; to be notified again, the watch should be set again are associated to a session (unregistered once the session ends)

ZooKeeper – Client API (generic form)
create (path, data, flags) Creates a znode at the specified path, which is filled with the specified data, and has the type (regular or ephemeral, sequential or normal) specified in the flags. The method returns the node name. delete (path, version) Deletes the znode at the specified path if the node has the specified version (optional, use -1 to ignore). exists (path, watch) Checks if a node exists at the specified path, and optionally sets a watch that triggers a notification when the node will be created, deleted or new data is set on it.

getData (path, watch) Returns the data at the specified path if that exists, and optionally sets a watch that triggers when new data is set at the znode in the path or the znode is deleted. setData (path, data, version) Sets the specified data at the specified path if the znode exists, optionally just if the node has the specified version. Returns a structure containing various information about the znode. getChildren (path, watch) Returns the set of names of the children of the znode at the specified path, and optionally sets a watch that triggers when either a new child is created or deleted at the path, or the node at the path is deleted.

sync (path) ZooKeeper offers a single system image to all connected clients (all clients have the same view of znodes no matter at which server are connected); depending on the server where a client is connected, updates might not be always already processed, when the client would execute a read operation; sync waits for all updates to propagate to the server where the client is connected; path parameter is simply ignored.

ZooKeeper – Client API All operations of read and write type (not the sync), have two forms: synchronous execute a single operation and block until this is finalized does not permit other concurrent tasks asynchronous sets a callback for the invoked operation, which is triggered when the operation completes does permit concurrent tasks an order guarantee is preserved for asynchronous callbacks based on their invocation order

ZooKeeper – Example scenario
Context (use of ZK, not ZK itself): Distributed applications that have a leader among them responsible with their coordination While the leader changes system configuration, none of the apps should start using the configuration being changed If the leader dies before finishing changing configuration, none of the apps should use the unfinished configuration

How it’s done (using ZK): The leader designates a /ready path node as flag for ready-to-use configuration, monitored by other apps While changing configuration the leader deletes the /ready node, and creates it back when finished FIFO ordering guarantees that apps are notified by the /ready node creation, only after configuration is finished Looks ok ... ... or not?

Q: What if one app sees /ready just before being deleted and starts reading the configuration while being changed? A: The app will be notified when /ready is deleted, so it knows new configuration is being set up, and old one is invalid. It just need to reset the (one-time triggered) watch on /ready to find out when is created. Q: What if one app is notified by a configuration change (node /ready deleted), but the app is slow and until setting a new watch, the node is already created and deleted again? A: Target of app’s action is reading/using an actual valid state of configuration (which can be, and is, the latest). Missing previous valid versions should not be critical.

Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University.

Similar presentations

Presentation on theme: "Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University.

Similar presentations

Presentation on theme: "Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University."— Presentation transcript:

Similar presentations

About project

Feedback