Smart Storage and Linux An EMC Perspective Ric Wheeler

Slides:



Advertisements
Similar presentations
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
Advertisements

© 2006 DataCore Software Corp SANmotion New: Simple and Painless Data Migration for Windows Systems Note: Must be displayed using PowerPoint Slideshow.
Copyright © 2007, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.
Virtualisation From the Bottom Up From storage to application.
Database Architectures and the Web
What’s New: Windows Server 2012 R2 Tim Vander Kooi Systems Architect
File Systems.
Introduction to DBA.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Allocation Methods - Contiguous
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Smart Storage and Linux An EMC Perspective Ric Wheeler
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
G Robert Grimm New York University Disco.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
STORAGE Virtualization
Module – 7 network-attached storage (NAS)
Implementing Failover Clustering with Hyper-V
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Selling the Database Edition for Oracle on HP-UX November 2000.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.
Managing Storage Lesson 3.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Baydel Founded in 1972 Headquarters: Surrey, England North American Headquarters: San Jose, CA Engineering Driven Organization Specialize in Computer Storage.
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Online Systems Status Review of requirements System configuration Current acquisitions Next steps... Upgrade Meeting 4-Sep-1997 Stu Fuess.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Database Edition for Sybase Sales Presentation. Market Drivers DBAs are facing immense time pressure in an environment with ever-increasing data Continuous.
Module – 4 Intelligent storage system
Storage Area Networks Back up and Management. Introduction  The Problems –Large Amounts of Data –Shared Resources –Need for interference-free back up.
Selling the Storage Edition for Oracle November 2000.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Components of a Sysplex. A sysplex is not a single product that you install in your data center. Rather, a sysplex is a collection of products, both hardware.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
VMware vSphere Configuration and Management v6
MDC323B SMB 3 is the answer Ned Pyle Sr. PM, Windows Server
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
Tackling I/O Issues 1 David Race 16 March 2010.
Virtual Machine Movement and Hyper-V Replica
CEG 2400 FALL 2012 Linux/UNIX Network Operating Systems.
Background Computer System Architectures Computer System Software.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
An Introduction to GPFS
Power Systems with POWER8 Technical Sales Skills V1
OpenMosix, Open SSI, and LinuxPMI
Chapter 11: File System Implementation
Chapter 1: Introduction
Maximum Availability Architecture Enterprise Technology Centre.
Introduction to Networks
Introduction of Week 3 Assignment Discussion
Cloud computing mechanisms
Specialized Cloud Architectures
Enterprise Class Virtual Tape Libraries
Improving performance
Presentation transcript:

Smart Storage and Linux An EMC Perspective Ric Wheeler

Why Smart Storage?  Central control of critical data  One central resource to fail-over in disaster planning  Banks, trading floor, air lines want zero downtime  Smart storage is shared by all hosts & OS’es  Amortize the costs of high availability and disaster planning over all of your hosts  Use different OS’es for different jobs (UNIX for the web, IBM mainframes for data processing)  Zero-time “transfer” from host to host when both are connected  Enables cluster file systems

Data Center Storage Systems  Change the way you think of storage  Shared Connectivity Model  “Magic” Disks  Scales to new capacity  Storage that runs for years at a time  Symmetrix case study  Symmetrix 8000 Architecture  Symmetrix Applications  Data center class operating systems

Traditional Model of Connectivity  Direct Connect  Disk attached directly to host  Private - OS controls access and provides security  Storage I/O traffic only Separate system used to support network I/O (networking, web browsing, NFS, etc)

Shared Models of Connectivity  VMS Cluster  Shared disk & partitions  Same OS on each node  Scales to dozens of nodes  IBM Mainframes  Shared disk & partitions  Same OS on each node  Handful of nodes  Network Disks  Shared disk/private partition  Same OS  Raw/block access via network  Handful of nodes

New Models of Connectivity  Every host in a data center could be connected to the same storage system  Heterogeneous OS & data format (CKD & FBA)  Management challenge: No central authority to provide access control Shared Storage IRIX DGUX FreeBSD MVS VMS Linux Solaris HPUX NT

Magic Disks  Instant copy  Devices, files or data bases  Remote data mirroring  Metropolitan area  100’s of kilometers  1000’s of virtual disks  Dynamic load balancing  Behind the scenes backup  No host involved

Scalable Storage Systems  Current systems support  10’s of terabytes  Dozens of SCSI, fibre channel, ESCON channels per host  Highly available (years of run time)  Online code upgrades  Potentially 100’s of hosts connected to the same device  Support for chaining storage boxes together locally or remotely

Longevity  Data should be forever  Storage needs to overcome network failures, power failures, blizzards, asteroid strikes …  Some boxes have run for over 5 years without a reboot or halt of operations  Storage features  No single point of failure inside the box  At least 2 connections to a host  Online code upgrades and patches  Call home on error, ability to fix field problems without disruptions  Remote data mirroring for real disasters

Symmetrix Architecture  32 PowerPC 750’s based “directors”  Up to 32 GB of central “cache” for user data  Support for SCSI, Fibre channel, Escon, …  384 drives (over 28 TB with 73 GB units)

Symmetrix Basic Architecture

Data Flow through a Symm

Read Performance

Prefetch is Key  Read hit gets RAM speed, read miss is spindle speed  What helps cached storage array performance?  Contiguous allocation of files (extent-based file systems) preserve logical to physical mapping  Hints from the host could help prediction  What might hurt performance?  Clustering small, unrelated writes into contiguous blocks (foils prefetch on later read of data)  Truly random read IO’s

Symmetrix Applications  Instant copy  TimeFinder  Remote data copy  SRDF (Symmetrix Remote Data Facility)  Serverless Backup and Restore  Fastrax  Mainframe & UNIX data sharing  IFS

Business Continuance Problem “Normal” Daily Operations Cycle Online Day BACKUP / DSS Resume Online Day Sunrise “Race to Sunrise” 4 Hours of Data Inaccessibility* 2 am 6 am

 Creation and control of a copy of any active application volume  Capability to allow the new copy to be used by another application or system  Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc.  Ability to create multiple copies of a single application volume  Non-disruptive re-synchronization when second application is complete TimeFinder Backups Decision Support Data Warehousing Euro Conversion BCV is a copy of real production data Sales BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME

Business Continuance Volumes âA Business Continuation Volume (BCV) is created and controlled at the logical volume level âPhysical drive sizes can be different, logical size must be identical âSeveral ACTIVE copies of data at once per Symmetrix

Using TimeFinder âEstablish BCV âStop transactions to clear buffers âSplit BCV âStart transactions âExecute against BCVs âRe-establish BCV M1 BCV M2

Re-Establishing a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on ‘BCV’ updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV” nSynch complete PROD M1 Split BCV Pair BCV M1 UPDATED Re-Establish BCV Pair PROD BCV UPDATED

Restore a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on “BCV” updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt restore BCV pair, “invalid” tracks are written from “BCV to PROD” nSynch complete PROD Split BCV Pair BCV UPDATED Restore BCV Pair PROD BCV UPDATED

Make as Many Copies as Needed nEstablish BCV 1 nSplit BCV 1 nEstablish BCV 2 nSplit BCV 2 nEstablish BCV 3 M1 BCV 1 M2 BCV 2 BCV 3 4 PM 5 PM 6 PM

The Purpose of SRDF  Local data copies are not enough  Maximalist  Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been.  Minimalist  Provide a means for generating periodic physical backups of the data.

Synchronous Data Mirroring  Write is received from the host into the cache of the source  I/O is transmitted to the cache of the target  ACK is provided by the target back to the cache of the source  Ending status is presented to the host  Symmetrix systems destage writes to disk  Useful for disaster recovery

Semi-Synchronous Mirroring  An I/O write is received from the host/server into the cache of the source  Ending status is presented to the host/server.  I/O is transmitted to the cache of the target  ACK is sent by the target back to the cache of the source  Each Symmetrix system destages writes to disk  Useful for adaptive copy

Backup / Restore of Big Data  Exploding amounts of data cause backups to run on too long  How long does it take you to backup 1 TB of data?  Shrinking backup window and constant pressure for continuous application up-time  Avoid using production environment for backup  No server CPU or I/O channels  No involvement of regular network  Performance must scale to match customer’s growth  Heterogeneous host support

Location 1 Location 2 Tape Library SCSI Fastrax Data Engine UNIX Linux Fastrax Enabled Backup/Restore Applications SYMAPI UNIX STD1 STD2 BCV1 R1R2 BCV2 Fibre Channel PtP Link(s) Fastrax Overview

Host to Tape Data Flow Symmetrix Fastrax Tape Library Host

Fastrax Performance  Performance scales with the number of data movers in the Fastrax box & number of tape devices  Restore runs as fast as backup  No performance impact on host during restore or backup RAF DM SRDF Fastrax

Moving Data from Mainframes to UNIX

InfoMover File System  Transparent availability of MVS data to Unix hosts  MVS datasets available as native Unix files  Sharing a single copy of MVS datasets  Uses MVS security and locking  Standard MVS access methods for locking + security

Mainframe 4 IBM MVS / OS390 Open Systems 4 IBM AIX 4 HP HP-UX 4 Sun Solaris ESCON Channel Parallel Channel FWD SCSI Ultra SCSI Fibre Channel Minimal Network Overhead -- No data transfer over network! -- MVS Data Symmetrix with ESP IFS Implementation

Symmetrix API’s

Symmetrix API Overview  SYMAPI Core Library  Used by “Thin” and Full Clients  SYMAPI Mapping Library  SYMCLI Command Line Interface

Symmetrix API’s  SYMAPI are the high level functions  Used by EMC’s ISV partners (Oracle, Veritas, etc) and by EMC applications  SYMCLI is the “Command Line Interface” which invoke SYMAPI  Used by end customers and some ISV applications.

Basic Architecture Symmetrix Application Programming Interface (SymAPI) Symmetrix Command Line Interpreter (SymCli) Other Storage Management Applications User access to the Solutions Enabler is via the SymCli or Storage Management Application

Client-Server Architecture  Symapi Server runs on the host computer connected to the Symmetrix storage controller  Symapi client runs on one or more host computers Client Host Server Host Thin Client Host SymAPI library SymAPI Server Storage Management Applications Storage Management Applications SymAPI Client Thin SymAPI Client

SymmAPI Components Initialization Discover and Update Configuration Gatekeepers TimeFinder Functions Device Groups DeltaMark Functions SRDF Functions Statistics Mapping FunctionsBase Controls Calypso Controls Optimizer Controls InfoSharing

Data Object Resolve RDBMS Data File File System Logical Volume Host Physical Device Symmetrix Device Extents

File System Mapping  File System mapping information includes:  File System attributes and host physical location.  Directory attributes and contents.  File attributes and host physical extent information, including inode information, fragment size. I-nodesDirectoriesFile extents

Data Center Hosts

Solaris & Sun Starfire  Hardware  Up to 62 IO Channels  64 CPU’s  64 GB of RAM  60 TB of disk  Supports multiple domains  Starfire & Symmetrix  ~20% use more than 32 IO channels  Most use 4 to 8 IO channels per domain  Oracle instance usually above 1 TB

HPUX & HP 9000 Superdome  Hardware  192 IO Channels  64 CPU’s cards  128 GB RAM  1 PB of storage  Superdome and Symm  16 LUNS per target  Want us to support more than 4000 logical volumes!

Solaris and Fujitsu GP7000F M1000  Hardware  6-48 I/O slots  4-32 CPU’s  Cross-Bar Switch  32 GB RAM  64-bit PCI bus  Up to 70TB of storage

Solaris and Fujitsu GP7000F M2000  Hardware  I/O slots  CPU’s  Cross-Bar Switch  256 GB RAM  64-bit PCI bus  Up to 70TB of storage

AIX 5L & IBM RS/6000 SP  Hardware  Scale to 512 Nodes (over 8000 CPUs)  32 TB RAM  473 TB Internal Storage Capacity  High Speed Interconnect 1GB/sec per channel with SP Switch2  Partitioned Workloads  Thousands of IO Channels

IBM RS/6000 pSeries 680 AIX 5L  Hardware  24 CPUs 64-bit RS64 IV  600MHz  96 MB RAM  GB Internal Storage Capacity  53 PCI slots 33 – 32bit/20-64bit

Really Big Data  IBM (Sequent) NUMA  16 NUMA “Quads” 4 way/ 450 MHz CPUs 2 GB Memory 4 x 100MB/s FC-SW  Oracle with up to 42 TB (mirrored) DB  EMC Symmetrix  20 Small Symm 4’s  2 Medium Symm 4’s

Windows 2000 on IA32  Usually lots of small (1u or 2u) boxes share a Symmetrix  4 to 8 IO channels per box  Qualified up to 1 TB per meta volume (although usually deployed with ½ TB or less)  Management is a challenge  Will 2000 on IA64 handle big data better?

Linux Data Center Wish List

Lots of Devices  Customers can uses hundreds of targets and LUN’s (logical volumes)  128 SCSI devices per system is too few  Better naming system to track lots of disks  Persistence for “not ready” devices in the name space would help some of our features  devfs solves some of this  Rational naming scheme  Potential for tons of disk devices (need SCSI driver work as well)

Support for Dynamic Data  What happens when the LV changes under a running file system? Adding new logical volumes?  Happens with TimeFinder, RDF, Fastrax  Requires a remounting, reloading drivers, rebooting?  API’s can be used to give “heads up” before events  Must be able to invalidate  Data, name and attribute caches for individual files or logical volumes  Support for dynamically loaded, layered drivers  Dynamic allocation of devices  Especially important for LUN’s  Add & remove devices as fibre channel fabric changes

Keep it Open  Open source is good for us  We can fix it or support it if you don’t want to  No need to reverse engineer some closed source FS/LVM  Leverage storage API’s  Add hooks to Linux file systems, LVM’s, sys admin tools  Make Linux manageable  Good management tools are crucial in large data centers

New Technology Opportunities  Linux can explore new technologies faster than most  iSCSI  SCSI over TCP for remote data copy?  SCSI over TCP for host storage connection?  High speed/zero-copy TCP is important to storage here!  Infiniband  Initially targeted at PCI replacement  High speed, high performance cluster infrastructure for file systems, LVM’s, etc Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec)  Support for IB as a storage connection?  Cluster file systems

Linux at EMC  Full support for Linux in SymAPI, RDF, TimeFinder, etc  Working with partners in the application space and the OS space to support Linux  Oracle Open World Demo of Oracle on Linux with over 20 Symms (could reach 1PB of storage!) EMC Symmetrix Enterprise Storage EMC Connectrix Enterprise Fiber Channel Switch Centralized Monitoring and Management

MOSIX and Linux Cluster File Systems

Our Problem: Code Builds  Over 70 OS developers  Each developer builds 15 variations of the OS  Each variation compiles over a million lines of code  Full build uses gigabytes of space, with 100k temporary files  User sandboxes stored in home directory over NFS  Full build took around 2 hours  2 users could build at once

Our Original Environment  Software  GNU tool chain  CVS for source control  Platform Computing’s Load Sharing Facility  Solaris on build nodes  Hardware  EMC NFS server (Celerra) with EMC Symmetrix back end  26 SUN Ultra-2 (dual 300 MHz CPU) boxes  FDDI ring used for interconnect

EMC's LSF Cluster

LSF Architecture  Distributed process scheduling and remote execution  No kernel modifications  Prefers to use static placement for load balancing  Applications need to link special library  License server controls cluster access  Master node in cluster  Manages load information  Makes scheduling decisions for all nodes  Uses modified GNU Make (lsmake)

MOSIX Architecture  Provide transparent, dynamic migration  Processes can migrate at any time  No user intervention required  Process thinks it is still running on its creation node  Dynamic load balancing  Use decentralized algorithm to continually level load in the cluster  Based on number of CPU's, speed of CPU's, RAM, etc  Worked great for distributed builds in 1989

MOSIX Mechanism  Each process has a unique home node  UHN is the node of the processes creation  Process appears to be running at its UHN  Invisible after migration to others on its new node  UHN runs a deputy  Encapsulates system state for migrated process  Acts as a proxy for some location-sensitive system calls after migration  Significant performance hit for IO over NFS, for example

MOSIX Migration Link layer deputy remote User level Kernel local process User level Kernel NFS

MOSIX Enhancements  MOSIX added static placement and remote execution  Leverage the load balancing infrastructure for placement decisions  Avoid creation of deputies  Lock remotely spawned processes down just in case  Fix several NFS caching related bugs  Modify some of our makefile rules

MOSIX Remote Execution Link layer deputy remote User level Kernel local process User level Kernel NFS

EMC MOSIX cluster  EMC’s original MOSIX cluster  Compute nodes changed from LSF to MOSIX  Network changed from FDDI to 100 megabit ethernet.  The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems.  Performance was great, but we can do better!

Latest Hardware Changes  Network upgrades  New switch deployed  Nodes to switch use 100 megabit ethernet  Switch to NFS server uses gigabit ethernet  NFS upgrades  50 gigabyte, striped file systems per user (compared to 9 gigabyte non-striped file systems)  Fast/wide differential SCSI between server and storage  Cluster upgrades  Added 28 more compute nodes  Added 4 “submittal” nodes

EMC MOSIX Cluster Gigabit Ether SCSI

Performance  Running Red Hat 6.0 with kernel (MOSIX and NFS patches applied)  Builds are now around minutes (down from hours)  Over 35 concurrent builds at once

Build Submissions

Cluster File System & MOSIX Fibre Channel Connectrix

DFSA Overview  DFSA provides the structure to allow migrated processes to always do local IO  MFS (MOSIX File System) created  No caching per node, write through  Serverless - all nodes can export/import files  Prototype for DFSA testing  Works like non-caching NFS

DFSA Requirements  One active inode/buffer in the cluster for each file  Time-stamps are cluster-wide, increasing  Some new FS operations  Identify: encapsulate dentry info  Compare: are two files the same?  Create: produce a new file from SB/ID info  Some new inode operations  Checkpath: verify path to file is unique  Dotdot: give true parent directory

Information  MOSIX:   GFS:   Migration Information  Process Migration, Milojicic, et al. To appear in ACM Computing Surveys,  Mobility: Processes, Computers and Agents, Milojicic, Douglis and Wheeler, ACM Press.