ZHT 1 Tonglin Li. Acknowledgements I’d like to thank Dr. Ioan Raicu for his support and advising, and the help from Raman Verma, Xi Duan, and Hui Jin.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Distributed Processing, Client/Server and Clusters
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
High throughput chain replication for read-mostly workloads
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Xiaowei Yang CompSci 356: Computer Network Architectures Lecture 22: Overlay Networks Xiaowei Yang
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Overview on ZHT 1.  General terms  Overview to NoSQL dabases and key-value stores  Introduction to ZHT  CS554 projects 2.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
Distributed Lookup Systems
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Object Naming & Content based Object Search 2/3/2003.
FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.
The Google File System.
Wide-area cooperative storage with CFS
Case Study - GFS.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
ZHT A Fast, Reliable and Scalable Zero-hop Distributed Hash Table
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
Extreme scale Lack of decomposition for insight Many services have centralized designs Impacts of service architectures  an open question Using Simulation.
Introduction to Hadoop and HDFS
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Slide credits: Thomas Kao
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Dynamo: Amazon’s Highly Available Key-value Store
Introduction to NewSQL
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Interpret the execution mode of SQL query in F1 Query paper
The SMART Way to Migrate Replicated Stateful Services
Database System Architectures
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
Presentation transcript:

ZHT 1 Tonglin Li

Acknowledgements I’d like to thank Dr. Ioan Raicu for his support and advising, and the help from Raman Verma, Xi Duan, and Hui Jin. This work is published in HPDC/SigMetrics 2011 poster session. 2

Background Data: files Metadata: data about files Distributed Storage System 3

State-of-art metadata management Relational DB based metadata Heavy Centralized metadata management Communication jam Fragile Not scalable Typical parallel file System: GPFS by IBM 4

Proposed work: a new DHT for metadata management What is a DHT? Why DHT? Fully distributed: no centralized bottleneck High performance: high aggregated I/O Fault tolerance But existing DHTs are not fast enough. Slow and heavy High latency 5

Related work: DHT Architecture TopologyRouting Time(hops) ChordRingLog(N) CAN Virtual multidimensional Cartesian coordinate space on a multi-torus O(dn l/d ) PastryHypercubeO(logN) TapestryHypercubeO(log B N) CycloidCube-connected-cycle graphO(d) KademliaRingLog(N) MemcachedRing2 C-MPIRingLog(N) DynamoRing0 6

Practical assumptions of HEC Reliable hardware Fast network interconnects Non-existent node “churn” Batch oriented: steady amount of resource 7

Overview of Design Zero-hop 8

Implementation: Persistency Database or key-value store Relational database: transaction, complex query BerkeleyDB, MySQL Key-value store: small, simple, fast, flexible Kyotocabinet, CouchDB, HBase Log recording and playback Bootstrap system requires to playback all log records for loading metadata 9

Implementation: Failure handling Insert If one try failed: send it to closest replica Mark this record as primary copy Recover to original node when reboot system Lookup If one try fail: try next one, until go through all replicas Remove Mark record removed(but not really remove) Recover to original node when reboot system 10

Membership management Static member list reliable hardware non-existent node “churn” If a node quit, it never come back Consistent hashing Remove a node doesn't impact the hash map much 11

Replication update Server-side replication Asynchronized update Sequential update among replicas P->R1; R1->R2; R2->R3 12

Performance evaluation Hardware: SiCortex SC nodes 4GB RAM/node 5,832 cores OS: Cento OS 5.0 (Linux) Batch execution system: SLURM 13

Throughput Ideal throughput: T i = tested single node throughput * node number Measured throughput : T a = Sum of all single node tested throughputs 14

Ideal vs. measured throughput 15

TCP v.s. UDP 16

ZHT v.s. C-MPI 17

Replication overhead 18

Replication overhead 19

Future work Comprehensive fault tolerance Dynamic membership management More protocol support (MBI…) Merge with FusionFS Data aware job scheduling Many optimizations Larger scale evaluation (BlueGene/P, etc)

Conclusion ZHT offer a good solution of distributed key-value store, they are Light-weighted: cost less than 10MB memory/node Scalable: near-linearly scales up to 5000 cores Very fast: 100,000 operations/ sec Low latency: about 10ms Wide range of use: open source 21

Questions? 22