Curator: Self-Managing Storage for Enterprise Clusters

Slides:



Advertisements
Similar presentations
The google file system Cs 595 Lecture 9.
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
1 The Google File System Reporter: You-Wei Zhang.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
An Introduction to GPFS
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Commvault and Nutanix October Changing IT landscape Today’s Challenges Datacenter Complexity Building for Scale Managing disparate solutions.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Image taken from: slideshare
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
TensorFlow– A system for large-scale machine learning
Distributed, real-time actionable insights on high-volume data streams
Unistore: Project Updates
Organizations Are Embracing New Opportunities
Slide credits: Thomas Kao
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Measurement-based Design
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Unistore: A Unified Storage Architecture for Cloud Computing
Google Filesystem Some slides taken from Alan Sussman.
Using External Persistent Volumes to Reduce Recovery Times and Achieve High Availability Dinesh Israni, Senior Software Engineer, Portworx Inc.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Introduction to Spark.
Distributed Systems CS
HashKV: Enabling Efficient Updates in KV Storage via Hashing
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Ch 4. The Evolution of Analytic Scalability
Hadoop Technopoints.
Specialized Cloud Architectures
Distributed Systems CS
Apache Hadoop and Spark
Presentation transcript:

Curator: Self-Managing Storage for Enterprise Clusters I. Cano*, S. Aiyar, V. Arora, M. Bhattacharyya, A. Chaganti, C. Cheah, B. Chun, K. Gupta, V. Khot and A. Krishnamurthy* Nutanix Inc. *University of Washington NSDI ’17

Enterprises have been increasingly relying on cluster storage systems

Enterprises have been increasingly relying on cluster storage systems Transparent access to storage Scale up storage by buying more resources

Cluster storage systems embody significant functionality to support the needs of enterprise clusters Automatic replication and recovery Seamless integration of SSDs and HDDs Snapshotting and reclamation of unnecessary data Space-saving transformations …

Much of their functionality can be performed in the background to keep the I/O path as simple as possible What is the appropriate systems support for engineering these background tasks?

Curator Framework and systems support for building background tasks Extensible, flexible and scalable framework Borrow ideas from big data analytics but use them inside of a storage system Synchronization between background tasks and foreground I/O Co-design background and foreground components Heterogeneity across and within clusters over time Use Reinforcement Learning to deal with heterogeneity

Curator Framework and systems support for building background tasks Extensible, flexible and scalable framework Borrow ideas from big data analytics but use them inside of a storage system Synchronization between background tasks and foreground I/O Co-design background and foreground components Heterogeneity across and within clusters over time Use Reinforcement Learning to deal with heterogeneity

Distributed Key-Value Store Metadata for the entire storage system stored in k-v store Foreground I/O and background tasks coordinate using the k-v store Key-value store supports replication and consistency using Paxos KEY-VALUE STORE I/O SERVICE (foreground) STORAGE MGMT (background)

Data Structures and Metadata Maps Data stored in units called extents Extents are grouped together and stored as extent groups on physical devices file 33 44 56 0-8K 8-16K 16-24K 24-32K 32-40K Extent Id Map eid egid 33 1984 44 56 egid physical location 1984 disk1,disk2 … Extent Group Id Map Multiple levels of redirection simplifies data sharing across files and helps with minimizing map updates

MapReduce Framework Globally distributed maps processed using MapReduce System-wide knowledge of metadata used to perform various self-managing tasks KEY-VALUE STORE STORAGE MGMT uses MapReduce I/O SERVICE

Example: Tiering Move cold data from fast (SSD) to slow storage (HDD, Cloud) Identify cold data using a MapReduce job Modified Time (mtime): Extent Group Id map Access Time (atime): Extent Group Id Access Map

Example: Tiering egid 120 egid 120 == “cold” ? Metadata Ring egid 120 mtime owned by Node A atime owned by Node D egid 120 == “cold” ? Maps globally distributed  not a local decision Use MapReduce to perform a “join” Node A Curator (A) Slave Node D Node B Curator (D) Slave Curator (B) Master Curator (C) Slave Node C

Example: Tiering Map phase Reduce phase Scan both metadata maps Emit egid -> mtime or atime Partition using egid Reduce phase Reduce based on egid Generate tuples (egid,mtime,atime) Sort locally and identify the cold egroups Curator (A) 120 -> mtime Curator (D) Curator (B) 120 -> atime Curator (C) (120,mtime,atime)

Other tasks implemented using Curator Disk Failure Disk Balancing Fault Tolerance Garbage Collection Data Removal Compression Erasure Coding Deduplication Snapshot Tree Reduction

Challenges Extensible, flexible and scalable framework Synchronization between background tasks and foreground I/O Heterogeneity across and within clusters over time

Challenges Extensible, flexible and scalable framework Synchronization between background tasks and foreground I/O Heterogeneity across and within clusters over time

Co-Design Background Tasks and Foreground I/O I/O Service provides an extended low-level API Storage Mgmt. only gives hints to I/O Service to act on the data Background tasks are batched and throttled Soft updates on metadata to achieve lock-free execution KEY-VALUE STORE STORAGE MGMT I/O SERVICE Tasks

Evaluation Clusters in-the-wild Key results ~ 50 clusters over a period of 2.5 months Key results Recovery 95% of clusters have at most 0.1% of under replicated data Garbage 90% of clusters have at most 2% of garbage More results in paper

50% of clusters have 75% usage Tiering Goal: maximize SSD effectiveness for both reads and writes Curator uses a threshold to achieve some “desired” SSD occupancy 50% of clusters have 75% usage

Challenges Extensible, flexible and scalable framework Synchronization between background tasks and foreground I/O Heterogeneity across and within clusters over time

Machine Learning - Reinforcement Learning Heterogeneity across cluster deployments Resources Workloads Dynamic changes within individual clusters Changing workloads Different loads over time Threshold-based heuristics sub-optimal ML to the rescue! Reinforcement Learning REINFORCEMENT LEARNING

Reinforcement Learning 101 AGENT ENVIRONMENT State st Reward rt st+1 rt+1 Action at

Agent Deployment How the agent is deployed? No prior knowledge and needs to learn from scratch Some prior knowledge and then keeps on learning while deployed

Use Case: Tiering State = (cpu, memory, ssd, iops, riops, wiops) Actions = { run, not run } Reward = - latency Leverage threshold-based heuristics to “bootstrap” our agent Build a dataset from real traces ~ 40 clusters in-the-wild ~ 32K examples State-Action-Reward tuples Pre-train two models, one for each action

Evaluation METRIC AVG IMPROVEMENT LATENCY 12 % SSD READS 16 %

Conclusions Curator: framework and systems support for building background tasks Borrowed ideas from big data analytics but used them inside of a storage system Distributed key-value store for metadata MapReduce framework to process metadata Co-designed background tasks and foreground I/O I/O Service provides an extended low-level API Storage Mgmt. only gives hints to I/O Service to act on actual data Used Reinforcement Learning to deal with heterogeneity across and within clusters over time Results on Tiering showed up to ~20% latency improvements