Extreme scale Lack of decomposition for insight Many services have centralized designs Impacts of service architectures  an open question Using Simulation.

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

High throughput chain replication for read-mostly workloads
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
ZHT 1 Tonglin Li. Acknowledgements I’d like to thank Dr. Ioan Raicu for his support and advising, and the help from Raman Verma, Xi Duan, and Hui Jin.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Schenker Presented by Greg Nims.
Overview on ZHT 1.  General terms  Overview to NoSQL dabases and key-value stores  Introduction to ZHT  CS554 projects 2.
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
ZHT A Fast, Reliable and Scalable Zero-hop Distributed Hash Table
Bridge the gap between HPC and HTC Applications structured as DAGs Data dependencies will be files that are written to and read from a file system Loosely.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
1 A scalable Content- Addressable Network Sylvia Rathnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker Pirammanayagam Manickavasagam.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
RUNNING PARALLEL APPLICATIONS BEYOND EP WORKLOADS IN DISTRIBUTED COMPUTING ENVIRONMENTS Zholudev Yury.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
IMDGs An essential part of your architecture. About me
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
DataSys Laboratory Dr. Ioan Raicu Michael Lang, USRC leader Abhishek Kulkarni, Ph.D student of Indiana University Poster Submission: –Ke Wang, Abhishek.
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Rendezvous Regions: A Scalable Architecture for Service Location and Data-Centric Storage in Large-Scale Wireless Sensor Networks Karim Seada, Ahmed Helmy.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Fault Tolerant Services
Geo-distributed Messaging with RabbitMQ
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Load Rebalancing for Distributed File Systems in Clouds.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
File system: Ceph Felipe León fi Computing, Clusters, Grids & Clouds Professor Andrey Y. Shevel ITMO University.
Introduction to Distributed Platforms
Distributed Network Traffic Feature Extraction for a Real-time IDS
Introduction to NewSQL
Plethora: Infrastructure and System Design
Consistency in Distributed Systems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
SCOPE: Scalable Consistency in Structured P2P Systems
Providing Secure Storage on the Internet
Distributed P2P File System
CLUSTER COMPUTING.
THE GOOGLE FILE SYSTEM.
Presentation transcript:

Extreme scale Lack of decomposition for insight Many services have centralized designs Impacts of service architectures  an open question Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 2

Modular components design for composable services Explore the design space for HPC services Evaluate the impacts of different design choices Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 3

A taxonomy for classifying HPC system services A simulation tool to explore Distributed Key-Value Stores (KVS) design choices for large-scale system services An evaluation of KVS design choices for extreme-scale systems using both synthetic and real workload traces Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 4

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 5

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 6

Job Launch, Resource Management Systems System Monitoring I/O Forwarding, File Systems Function Call Shipping Key-Value Stores Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 7

Scalability Dynamicity Fault Tolerance Consistency Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 8

Large volume of data and state information Distributed NoSQL data stores used as building blocks Examples:  Resource management (job, node status info)  Monitoring (system active logs)  File systems (metadata)  SLURM++, MATRIX [1], FusionFS [2] Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 9 [1] K. Wang, I. Raicu. “Paving the Road to exascale through Many Task Computing”, Doctor Showcase, IEEE/ACM Supercomputing 2012 (SC12) [2] D. Zhao, I. Raicu. “Distributed File Systems for Exascale Computing”, Doctor Showcase, IEEE/ACM Supercomputing 2012 (SC12)

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 10

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services Decomposition Categorization Suggestion Implication 11

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services Service model: functionality Data model: distribution and management of data Network model: dictates how the components are connected Recovery model: how to deal with component failures Consistency model: how rapidly data modifications propagate 12

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 13 Data model: centralized Network model: aggregation tree Recovery model: fail-over Consistency model: strong

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 14 Data Model: distributed with partition Network Model: fully-connected partial knowledge Recovery Model: consecutive replicas Consistency Model: strong, eventual VoldemortPastryZHT Datadistributed Network fully- connected partially- connected fully- connected Recovery n-way replications Consiste ncy eventualstrongeventual

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 15

Discrete Event Simulation  PeerSim  Evaluated others: OMNET++, OverSim, SimPy Configurable number of servers and clients Different architectures Two parallel queues in a server  Communication queue (send/receive requests)  Processing queue (process request locally) Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 16

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 17 The time to resolve a query locally (t LR ), and the time to resolve a remote query (t RR ) is given by: t LR = CS + SR + LP + SS + CR For fully connected: t RR = t LR + 2 × (SS + SR) For partially connected: t RR = t LR + 2k × (SS + SR) where k is the number of hops to find the predecessor

Defines what to do when a node fails How a node-state recovers when rejoining after failure Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 18 s 0 r 5,1 r 4,2 s 1 r 0,1 r 5,2 s 4 r 3,1 r 2,2 s 2 r 1,1 r 0,2 s 3 r 2,1 r 1,2 s 5 r 4,1 r 3,2 client EM X notify failure replicate s0 data first replica down second replica down replicate my data s 0 r 5,1 r 4,2 s 1 r 0,1 r 5,2 s 4 r 3,1 r 2,2 s 2 r 1,1 r 0,2 s 3 r 2,1 r 1,2 s 5 r 4,1 r 3,2 client EM X notify back s 0, s 4, s 5 data remove s 0 data s 0 is back remove s 5 data

Strong Consistency  Every replica observes every update in the same order  Client sends requests to a dedicated server (primary replica) Eventual Consistency  Requests are sent to randomly chosen replica (coordinator)  Three key parameters: N, R, W, satisfying R + W > N  Use Dynamo [G. Decandia, 2007] version clock to track different versions of data and detect conflicts Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 19

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 20

Evaluate the overheads  Different architectures, focus on distributed ones  Different models Light-weight simulations:  Largest experiments  25GB RAM, 40 min walltime Workloads  Synthetic workload with 64-bit key space  Real workload traces from 3 representative system services: job launch, system monitoring, and I/O forwarding Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 21

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 22 Validate against ZHT [1] (left) and Voldemort (right) ZHT  BG/P up to 8K nodes (32K cores) Voldemort  PROBE Kodiak Cluster up to 800 nodes [1] T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, I. Raicu. “ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table”, IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2013

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 23 Partial connectivity  higher latency due to the additional routing Fully-connected topology  faster response (twice as fast at extreme scale)

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 24 Adding replicas always involve overheads Replicas have larger impact on fully connected than on partially connected

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 25 Higher failure frequency introduces more overhead, but the dominating factor is the client request processing messages

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 26 Eventual consistency has more overhead than the strong consistency

Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 27 Fully connected Partially connected For job launch and I/O forwarding Eventual consistency performs worse  almost URD for both request type and the key Monitoring Eventual consistency works better  all requests are “put”

ZHT (distributed key/value storage)  DKVS implementation MATRIX (runtime system)  DKVS is used to keep task meta-data SLURM++ (job management system)  DKVS is used to store task & resource information FusionFS (distributed file system)  DKVS is used to maintain file/directory meta-data Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 28

Introduction & Motivation Key-Value Store Taxonomy Key-Value Store Simulation Evaluation Conclusions & Future Work Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 29

A taxonomy for classifying HPC system services A simulation tool to explore KVS design choices for large-scale system services An evaluation of KVS design choices for extreme-scale systems using both synthetic and real workload traces Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 30

Key-value Store is building block Service taxonomy is important Simulation framework to study services Distributed architecture is demanded Replication adds overhead Fully-connected topology is good  As long as the request processing message dominates Consistency tradeoffs Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 31 Write-Intensity/Availability Read-Intensity/Performance Eventual Consistency Strong Consistency Weak Consistency

Extend the simulator to cover more of the taxonomy Explore other recovery models  log-based  information dispersal algorithm Explore other consistency models Explore using DKVS in the development of: General building block library Distributed monitoring system service Distributed message queue system Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 32

DOE contract: DE-FC02-06ER25750 Part of NSF award: CNS (PRObE) Collaboration with FusionFS project under NSF grant: NSF BG/P resource from ANL Thanks to Tonglin Li, Dongfang Zhao, Hakan Akkan Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 33

More information: – Contact: Questions? Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 34

Service Simulation  Peer-to-peer networks simulation  Telephony simulations  Simulation of consistency  Problem: not focus on HPC, or combine distributed features Taxonomy  Investigation of distributed hash tables, and an algorithm taxonomy  Grid computing workflows taxonomy  Problems: none of them drive features in a simulation Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services 35