HUG – India Meet November 28, 2009 Noida Apache ZooKeeper Aby Abraham.

Slides:



Advertisements
Similar presentations
Paxos and Zookeeper Roy Campbell.
Advertisements

P. Hunt, M Konar, F. Junqueira, B. Reed Presented by David Stein for ECE598YL SP12.
Apache ZooKeeper By Patrick Hunt, Mahadev Konar
Wait-free coordination for Internet-scale systems
CS 542: Topics in Distributed Systems Diganta Goswami.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
Chubby Lock server for distributed applications 1Dennis Kafura – CS5204 – Operating Systems.
Flavio Junqueira, Mahadev Konar, Andrew Kornev, Benjamin Reed
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CHUBBY and PAXOS Sergio Bernales 1Dennis Kafura – CS5204 – Operating Systems.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Case Study - GFS.
Synchronization Methods for Multicore Programming Brendan Lynch.
1 The Google File System Reporter: You-Wei Zhang.
MAHADEV KONAR Apache ZooKeeper. What is ZooKeeper? A highly available, scalable, distributed coordination kernel.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
 Anil Nori Distinguished Engineer Microsoft Corporation.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms –Bully algorithm.
Event Data History David Adams BNL Atlas Software Week December 2001.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Introduction to ZooKeeper. Agenda  What is ZooKeeper (ZK)  What ZK can do  How ZK works  ZK interface  What ZK ensures.
Motivation Large-scale distributed application require different forms of coordination: Configuration Group membership and leader election Synchronization.
Fault Tolerance (2). Topics r Reliable Group Communication.
Zookeeper Wait-Free Coordination for Internet-Scale Systems.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Apache ZooKeeper CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Bigtable A Distributed Storage System for Structured Data.
Detour: Distributed Systems Techniques
강호영 Contents ZooKeeper Overview ZooKeeper’s Performance ZooKeeper’s Reliability ZooKeeper’s Architecture Running Replicated ZooKeeper.
Event Based Systems Time and synchronization (II), CAP theorem and ZooKeeper Dr. Emanuel Onica Faculty of Computer Science, Alexandru Ioan Cuza University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Bigtable A Distributed Storage System for Structured Data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Advanced Topics in Distributed and Reactive Programming
ZooKeeper Claudia Hauff.
Open Source distributed document DB for an enterprise
.NET Performance Solutions
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Introduction to Apache ZooKeeper™
Ministry of Higher Education
Advanced Topics in Distributed and Reactive Programming
EECS 498 Introduction to Distributed Systems Fall 2017
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
GARRETT SINGLETARY.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Introduction to Apache
Wait-free coordination for Internet-scale systems
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
THE GOOGLE FILE SYSTEM.
Advanced Topics in Functional and Reactive Programming
ZooKeeper Justin Magnotti 9/19/18.
MapReduce: Simplified Data Processing on Large Clusters
Exceptions and networking
Pig Hive HBase Zookeeper
Presentation transcript:

HUG – India Meet November 28, 2009 Noida Apache ZooKeeper Aby Abraham

- 2 - Agenda »Overview of Zookeeper »Use cases and examples »Essential internals »Future directions »Related works and references

- 3 - What is Zookeeper? »A scalable, distributed, open-source coordination service for distributed applications »Provides a simple set of primitives to implement higher level services for synchronization, configuration maintenance, consensus, leader election, groups and naming in a distributed system.

- 4 - Why Use Zookeeper? »If you love to have sleepless nights debugging distributed synchronization problems – please ignore the rest of the presentation »Difficulty of implementing distributed services › Complex distributed algorithms are notoriously difficult to implement correctly › Prone to race conditions and dead locks. And distributed deadlocks are the worst! › Different implementations lead to management complexity when the applications are deployed »Other programming models using distributed locks or State Machine Replication are difficult to use correctly »Zookeeper solves these problems for us by providing a simple and already familiar programming model »Zookeeper provides reusable code libraries for common use cases – very easy to use

- 5 - Who uses Zookeeper? »Deepdyve - Does search for research and provide access to high quality content using advanced search technologies. ZK is used to manage server state, control index deployment and a myriad other tasks »Katta - Katta serves distributed Lucene indexes in a grid environment. ZK is used for node, master and index management in the grid »101tec – Does consulting in the area of enterprise distributed systems. Uses ZK to manage a system build out of hadoop, katta, oracle batch jobs and a web component »Hbase – HBase is an open-source distributed column-oriented database on hadoop. Uses ZK for master election, server lease management, bootstrapping, and coordination between servers. »Rackspace – & Apps team uses ZK to co-ordinate sharding, handling responsibility changes, and distributed locking »Yahoo! – ZK is used for a myriad of services inside Yahoo! for doing leader election, configuration management, cluster management, load balancing, sharding, locking, work queues, group membership etc »Going to be included in Cloudera Distribution of Hadoop »Hadoop namenode is a candidate »More in pipeline

- 6 - Design Goals »Reliability »Availability »Concurrency »Performance »Simplicity

- 7 - How does it look like? Zookeeper Service Server Leader Client »Data is stored in-memory in all the servers »A leader is elected at start-up »Followers service clients, all updates go through leader »Update responses are sent when a majority of servers have persisted the change

- 8 - Basic Concepts »Allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system »The namespace consists of data registers – znodes »Provides a very simple API and programming model »The API is similar to that of a file system, but different »Data is kept in memory – high throughput and low latency »Provides strictly ordered updates and accesses »Provides certain guarantees for the operations, based on which higher level concepts can be built »Supports additional features such as change notification (watches), ephemeral nodes and conditional updates

- 9 - Data Model / services users apps locks servers myservice read-1 stupidname2 stupidname1 »Hierarchical data model, much like a standard distributed file system »Nodes are known as znodes, and identified by a path »znode can have data associated with it, and children. Data is in KBs »znodes are versioned »Data is read/written in its entirety »znodes can be ephemeral nodes – exists as long as the session that created it is active »Watches can be set on znodes »Auto generation of file names

Zookeeper API »String create (path, data, acl, flags) »void delete (path, expectedVersion) »Stat setData (path, data, expectedVersion) »byte[] getData (path, watch) »Stat exists (path, watch) »String[] getChildren (path, watch) »void sync (path)

Zookeeper Session »ZK client establishes connection to ZK service, using a language binding. (Java, C, Perl, Python, REST) »List of servers provided – retry the connection until it is (re)established »When a client gets a handle to the ZK service, ZK creates a ZK session, represented as a 64-bit number » If reconnected to a different server within the session timeout, session remains the same »Session is kept alive by periodic PING requests from the client library

Ephemeral Nodes, Watches »Ephemeral nodes › Present as long as the session that created it is active › Cannot have child nodes »Watches › Tell me when something changes. E.g. Configuration data › One time trigger. Have to be reset by the client if interested in future notifications › Not a full fledged notification system. Its like clients asking for recommendations. Client should verify the state after receiving the watch event › Ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event › Default watcher – notified of state changes in the client (connection loss, session expiry, …)

Guarantees »Since its goal is to be a basis for the construction of more complicated services such as synchronization, it provides a set of guarantees »Sequential Consistency - Updates from a client will be applied in the order that they were sent »Atomicity - Updates either succeed or fail. No partial results »Single System Image - A single client will see the same view of the service regardless of the server that it connects to »Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update »Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound

Performance

Use cases »Use cases inside Yahoo! › Leader Election › Group Membership › Work Queues › Configuration Management › Cluster Management › Load Balancing › Sharding »Use cases in HBase › Leader Election › Configuration Management – store bootstrap information › Group membership – discover tablet servers and finalize tablet server death › To be done: store schema information and ACLs

Example - Leader Election / services leader myservice Contains leader name/details 1.getData (“/services/myservice/leader”, true) 2.If successful, follow the leader described in the data and exit 3.create (“/services/myservice/leader”, hostname, EPHEMERAL) 4.If successful, lead and exit 5.Go to step 1 Leader election algorithm – when exactly one of N service providers have to be available: Note: If you want to have M processes of a set of N processes to be active, the algorithm can be modified to do so

Example – Configuration management ZooKeeper Service Configuration data Client Watchers »Configuration data stored in znodes »Clients set watchers »Clients are notified when the configuration is updated »Clients reset the watch, reads the latest configuration and takes Appropriate action

Recipes - Locks Lock - Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. General idea: Lock – parent node, processes = children (using SEQUENCE and EPHEMERAL flag) lock () { 1. id = create (“…/locks/x-”, SEQUENCE|EPHEMERAL) 2. getChildren(“…/locks/”, false) 3. if id is the first child, exit (Got the lock) 4. if (exists (next lowest sequence number, true)) go to step 2 5. else wait for notification, go to step 2 } release() { delete the node created in step 1 }

Recipes - Barrier Barrier - a primitive that enables a group of processes to synchronize the beginning and the end of a computation General idea: Barrier – parent node, participants = children enter() { create new node for process p wait until child list size == barrier size } leave() { delete the node for process p wait until child list size == 0 }

Recipes – Producer Consumer Queues Producer Consumer Queue - a distributed data structure that a group of processes use to generate and consume items. Producer processes create new elements and add them to the queue. Consumer processes remove elements from the list, and process them. General idea: Queue – parent node, items = children (using SEQUENCE flag) produce (int i) { create new node with SEQUENCE flag and value I } int consume() { return the value of the child with the smallest counter }

Leader Election in Perl my $zkh = Net::ZooKeeper->new(‘localhost:7000’); my $req_path = “/app/leader”; $path = $zkh->get($req_path, ‘stat’=> $stat, ‘watch’=>$watch); if (defined $path) { #someone else is the leader. #parse the string path that contains the leader address } else { $path = $zkh->create($req_path, “hostname:info”, 'flags' => ZOO_EPHEMERAL, 'acl' => ZOO_OPEN_ACL_UNSAFE) ; if (defined $path) { #we are the leader } else { $path = $zkh->get($req_path, ‘stat’=> $stat, ‘watch’=>$watch); #someone else is the leader #parse the string path that contains the leader address }

Essential Internals › Leader + Followers, 2f+1 nodes can tolerate failure of f nodes › Consistency model – completely ordered history of updates. All updates go through the leader › Replication. No SPOF. › All replicas can accept requests. › If the leader fails, a new one is elected › It’s a system designed for few writes and many reads › Consistency using consensus – well known ways are Paxos algorithm, State Machine Replication, etc. › These are notoriously difficult. SMR is very difficult if your application doesn’t fit that model. › ZooKeeper uses ZooKeeper Atomic Broadcast protocol (ZAB) › ZAB – very similar to multi-Paxos, but the differences are real › The implementation builds upon the FIFO property of TCP stream

Future Directions »Support for WAN – cross colo operation »Improve the usability »Partitioned ZooKeeper servers »Performance enhancements

References »Algorithms › Paxos, multi-Paxos algorithms › State Machine Replication model › Atomic Broadcast »Related projects › Chubby lock service from Google