Trends in Scalable Storage System Design and Implementation March 15, 2012 Prof. Matthew O’Keefe Department of Electrical and Computer Engineering University.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
1 The Google File System Reporter: You-Wei Zhang.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Session-8 Data Management for Decision Support
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Large Scale Parallel File System and Cluster Management ICT, CAS.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Ceph: A Scalable, High-Performance Distributed File System
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Parallel IO for Cluster Computing Tran, Van Hoai.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Tackling I/O Issues 1 David Race 16 March 2010.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
An Introduction to GPFS
BIG DATA/ Hadoop Interview Questions.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Large-scale file systems and Map-Reduce
CHAPTER 3 Architectures for Distributed Systems
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Storage Virtualization
A Survey on Distributed File Systems
INFO 344 Web Tools And Development
Hadoop Technopoints.
Distributed File Systems
THE GOOGLE FILE SYSTEM.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Trends in Scalable Storage System Design and Implementation March 15, 2012 Prof. Matthew O’Keefe Department of Electrical and Computer Engineering University of Minnesota

Organization of Talk History: work in 1990’s and 2000’s Scalable File Systems ◦ Google File System ◦ Hadoop File System ◦ Sorrento, Ceph, Ward Swarms Comment on FLASH Questions…

Prior Work Part I: Generating Parallel Code Detecting parallelism in scientific codes, generating efficient parallel code Historically, had been done on loop-by-loop basis  Distributed memory parallel computers required more aggressive optimization  Parallel programming still a lot like assembly language programming  Increasing scope of code to analyze optimize as parallelism increases  What’s needed is a way to express the problem solution at a much higher level from which efficient code can be generated  Leverage design patterns and translation technologies to reduce the semantic gap

Exploiting Problem Structure

Prior Work Part II: Making Storage Systems Go Faster and Scale More

Storage System Scalability/Speed Storage interface standards lacked ability to scale in both speed and connectivity Industry countered this with new standard: Fibre Channel Allowed shared disks, but system software like file systems and volume managers not built to exploit this Started the Global File System project at U. of Minnesota in 1995 to counter this Started Sistina Software to commercialize GFS and LVM, sold to Red Hat in 2003 Worked for Red Hat for 1.5 years, then (being a glutton for punishment) started another company ( Alvarri) to do cloud backup in 2006 Alvarri sold to Quest software in late 2008, dedupe engine part of Quest’s backup product So two commercial products developed and still shipping

Making Storage Systems Faster and More Scalable GFS pioneered several interesting techniques for cluster file systems: ◦ no central metadata server ◦ distributed journals for performance, fast recovery ◦ first Distributed Lock Manager for Linux — now used in other cluster projects in Linux Implemented POSIX IO ◦ Assumption at time was: POSIX is all there is, have to implement that ◦ Kind of naïve, assumed that it had to be possible ◦ UNIX/Windows view of files as linear stream of bytes which can be read/written to anywhere in file by multiple processors ◦ Large files, small files, millions of files, directory tree structure, synchronous write/read semantics, etc. all make POSIX difficult to implement

Why POSIX File Systems Are Hard They’re in the kernel and tightly integrated with complex kernel subsystems like virtual memory Byte-granularity, coherency, randomness Users expect them to be extremely fast, reliable, and resilient Add parallel clients and large storage networks (e.g., Lustre or Panassas) things get even harder POSIX IO was the emphasis for parallel HPC IO (1999 through 2010) until recently HPC community re-thinking this Web/cloud has already moved on

Meanwhile: Google File System and its Clone (Hadoop) Google and others (Hadoop) went a different direction: change the interface from POSIX IO to something inherently more scalable Users have to write (re-write) applications to exploit the interface All about scalability — using commodity server hardware — for a specific kind of workload Hardware-software co-design: ◦ append-only write semantics from parallel producers ◦ mostly write-once, read many times by consumers ◦ explicit contract on performance expectations: small reads and writes — Fuggedaboutit! Obviously quite successful, and Hadoop is becoming something of an industry standard (at least the API’s) Lesson: if solving the problem is really, really hard, look at it a different way, move interfaces around, change your assumptions (e.g., as in the parallel programming problem)

Google/Hadoop File Systems Google needed a storage system for its web index, various applications — enormous scale ◦ GFS paper at FAST conference in 2004 led to development of Hadoop, open source GoogleFS clone Co-designed file system with applications Applications use map-reduce paradigm ◦ Streaming (complete) reads of very large file/datasets, process this data into reduced form (e.g., an index) ◦ Files access is write-once, append-only, read-many

Map-Reduce cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Simple model for parallel processing Natural for: – Log processing – Web search indexing – Ad-hoc queries Popular at Facebook, Google, Amazon etc. to determine what ads/products to throw at you

Scalable File System Goals Build with commodity components that frequently fail (to keep things cheap) ◦ So design assumes failure is common case Nodes incrementally join and leave the cluster Scale to 10s to 100s of Petabytes, headed towards exabytes; 1000’s to 10s of 10,000s of storage nodes and clients Automated administration, simplified recovery (in theory, not practice)

Hadoop Slides Hadoop is open source Until recently missing these GoogleFS features: ◦ Snapshots ◦ HA for NameNode ◦ Multiple writers (appender’s) to single file GoogleFS ◦ Within 12 to 18 months, GoogleFS was found to be inadequate for Google’s scaling needs, had to build federated GoogleFS clusters

Other Research on Scalable Storage Clusters (none in production yet) Ceph – Sage Weil, UCSC : POSIX lite ◦ Multiple metadata servers, dynamic workload balancing ◦ Mathematical hash to map file segments to nodes Sorrento – UCSB : POSIX with low write-sharing ◦ Distributed algorithm for capacity and load balancing, distributed metadata ◦ Lazy consistency semantics Ward Swarms – Lee Ward, Sandia ◦ Similar to Sorrento, uses victim cache and storage tiering

Reliability Motivates Action DoE building exascale machines with 100,000+ separate multicore processors for simulations Storage systems for current teraflop and petaflop systems do not provide sufficient reliability or speed at scale Just adding more disks does not solve the problem, requires a whole new design philosophy

Semantics Motivate Action IEEE POSIX I/O is nice and familiar ◦ Portable and guarantees coherent access and views ◦ Stable API ◦ But it cannot scale

Architecture Motivates Action Existing, accepted, solutions all use the same architecture ◦ Based on “Zebra”, it is a centrally managed metadata service and collection of stores ◦ Examples: Lustre, Minnesota GFS, Panassas File System, IBM GPFS, etc. Existing, at scale, file systems are challenged ◦ Managing all the simultaneous transfers is daunting ◦ Managing the number of components is daunting Anticipate multiple orders of magnitude growth with move to exascale

Nebula: A Peer-To-Peer Inspired Architecture A self-organizing store ◦ Peers join swarm to maximize aggregate bandwidth need A self-reorganizing store ◦ Highly redundant components supply symmetric services ◦ Failure of a component or comms link only motivates the search for an equivalent peer

Node-Local Capacity Management Since we want to make more replicas than required for performance reasons ◦ We must not tolerate a concept of “full” ◦ We must be able to eject file segments Can’t do this in a file system, they are persistent stores Can do this in a cache! ◦ But scary things can happen ◦ If all are equal partners we can thrash  To the point where we deadlock, even ◦ And at some point, somewhere we must guarantee persistence

Cache Tiers Need to order the storage nodes by capability ◦ Some combination of storage latency, storage bandwidth, how full, communication performance characteristics ◦ A tuple (which can be ordered) can approximate all of these With an ordering, we can apply concepts like better and worse When we can apply “better” and “worse” we can generate a set of rules ◦ That tell us when and whether we can eject a segment ◦ That tell us how we may eject a segment Effectively, we will impose a tier-like structure

Think of it as a Sliding Scale More reliable components Larger capacity Lower performance ◦ Maybe even offline Tends to ◦ Less churn ◦ Older data Less reliable –Maybe even volatile High performance Tends to –Very high churn –Only current data “Worse”“Better”

Victim-Caches With a tier-like concept we can implement victim- caches ◦ Just a cache that another cache ejects into We introduce a couple of relatively simple rules to motivate all kinds of things ◦ When ejecting, if a copy must be maintained then eject into a victim cache the lies within a similar or “worse” tier  “Worse” translates as more persistence, potentially higher latency and lower bandwidth. When application is recruiting for performance, best candidates are those from “better” tiers  Better translates as more volatile, lower latency and higher bandwidth

Data Movement: Writing App writes into highly volatile, local swarm Which is small and must eject To less performant, more persistent nodes Which must eject To a low performance, high persistence source

Reading is similar First, find all the parts of the file Start requesting segments If not happy with achieved bandwidth ◦ Find some nodes in high, “best” tier and ask them to stage parts of the file  Since they will use the same sources as the requestor, they will tend to ask near tiers to stage as well in order to provide them the bandwidth they require  Rinse, wash, repeat throughout ◦ Start requesting segments from them as well, redundantly ◦ As they ramp, source bias logic will shift focus onto the recruited nodes

Current Status of Project Primitive client and storage node prototype working now Looking for students (undergrad and grad students) — happy to co-advise with other faculty Ramping up… Using memcached and libmemcached (name- value store) as a development and research tool: support HPC and web workloads Nebula storage server semantics are richer ◦ Question: how much does that help?

Software-Hardware Co-Design of First Tier Storage Node New hardware technologies (SSD, hybrid DRAM) pushing the limits of OS storage/networking stack Question: Is it possible to co-design custom hardware and software for first-tier storage node? File system in VLSI: BlueArc NAS ◦ design FPGA for specific performance goal

Nibbler: Co-Design Hardware and Software Accelerates memcached and Nebula storage node performance via SSD’s and hybrid DRAM ◦ Hardware-software codesign ◦ e.g., BlueArc/HDS puts file system in VLSI Large, somewhat volatile memory Performance first, but also power and density First-tier storage node will drive ultimate performance achievable by this design

Top 10 Lessons in Software Development Be humble (seriously) Be paranoid: for large projects with aggressive performance or reliability goals, you have to do just about everything right Be open: to new ideas, new people, new approaches Make good senior people mentor the next generation (e.g., pair programming): temptation is to reduce/remove their distractions Let the smartest, most capable people bubble up to the top of the organization Perfect is the enemy of the good Take upstream processes seriously, but beware the fuzzy front-end Upstream (formal) inspections not just reviews For projects of any size, never abandon good SE practices, especially rules around coding styles, regularity, commenting and documentation Never stop learning: software engineering practices and tools, new algorithms and libraries, etc. — and if in management, create an environment that let’s people learn

Questions?