Presentation is loading. Please wait.

Presentation is loading. Please wait.

Smart Storage and Linux An EMC Perspective Ric Wheeler

Similar presentations


Presentation on theme: "Smart Storage and Linux An EMC Perspective Ric Wheeler"— Presentation transcript:

1 Smart Storage and Linux An EMC Perspective Ric Wheeler ric@emc.com

2 Why Smart Storage?  Central control of critical data  One central resource to fail-over in disaster planning  Banks, trading floor, air lines want zero downtime  Smart storage is shared by all hosts & OS’es  Amortize the costs of high availability and disaster planning over all of your hosts  Use different OS’es for different jobs (UNIX for the web, IBM mainframes for data processing)  Zero-time “transfer” from host to host when both are connected  Enables cluster file systems

3 Data Center Storage Systems  Change the way you think of storage  Shared Connectivity Model  “Magic” Disks  Scales to new capacity  Storage that runs for years at a time  Symmetrix case study  Symmetrix 8000 Architecture  Symmetrix Applications  Data center class operating systems

4 Traditional Model of Connectivity  Direct Connect  Disk attached directly to host  Private - OS controls access and provides security  Storage I/O traffic only Separate system used to support network I/O (networking, web browsing, NFS, etc)

5 Shared Models of Connectivity  VMS Cluster  Shared disk & partitions  Same OS on each node  Scales to dozens of nodes  IBM Mainframes  Shared disk & partitions  Same OS on each node  Handful of nodes  Network Disks  Shared disk/private partition  Same OS  Raw/block access via network  Handful of nodes

6 New Models of Connectivity  Every host in a data center could be connected to the same storage system  Heterogeneous OS & data format (CKD & FBA)  Management challenge: No central authority to provide access control Shared Storage IRIX DGUX FreeBSD MVS VMS Linux Solaris HPUX NT

7 Magic Disks  Instant copy  Devices, files or data bases  Remote data mirroring  Metropolitan area  100’s of kilometers  1000’s of virtual disks  Dynamic load balancing  Behind the scenes backup  No host involved

8 Scalable Storage Systems  Current systems support  10’s of terabytes  Dozens of SCSI, fibre channel, ESCON channels per host  Highly available (years of run time)  Online code upgrades  Potentially 100’s of hosts connected to the same device  Support for chaining storage boxes together locally or remotely

9 Longevity  Data should be forever  Storage needs to overcome network failures, power failures, blizzards, asteroid strikes …  Some boxes have run for over 5 years without a reboot or halt of operations  Storage features  No single point of failure inside the box  At least 2 connections to a host  Online code upgrades and patches  Call home on error, ability to fix field problems without disruptions  Remote data mirroring for real disasters

10 Symmetrix Architecture  32 PowerPC 750’s based “directors”  Up to 32 GB of central “cache” for user data  Support for SCSI, Fibre channel, Escon, …  384 drives (over 28 TB with 73 GB units)

11 Symmetrix Basic Architecture

12 Data Flow through a Symm

13 Read Performance

14 Prefetch is Key  Read hit gets RAM speed, read miss is spindle speed  What helps cached storage array performance?  Contiguous allocation of files (extent-based file systems) preserve logical to physical mapping  Hints from the host could help prediction  What might hurt performance?  Clustering small, unrelated writes into contiguous blocks (foils prefetch on later read of data)  Truly random read IO’s

15 Symmetrix Applications  Instant copy  TimeFinder  Remote data copy  SRDF (Symmetrix Remote Data Facility)  Serverless Backup and Restore  Fastrax  Mainframe & UNIX data sharing  IFS

16 Business Continuance Problem “Normal” Daily Operations Cycle Online Day BACKUP / DSS Resume Online Day Sunrise “Race to Sunrise” 4 Hours of Data Inaccessibility* 2 am 6 am

17  Creation and control of a copy of any active application volume  Capability to allow the new copy to be used by another application or system  Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc.  Ability to create multiple copies of a single application volume  Non-disruptive re-synchronization when second application is complete TimeFinder Backups Decision Support Data Warehousing Euro Conversion BCV is a copy of real production data Sales BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME BUSINESS CONTINUANCE VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME PRODUCTION APPLICATION VOLUME

18 Business Continuance Volumes âA Business Continuation Volume (BCV) is created and controlled at the logical volume level âPhysical drive sizes can be different, logical size must be identical âSeveral ACTIVE copies of data at once per Symmetrix

19 Using TimeFinder âEstablish BCV âStop transactions to clear buffers âSplit BCV âStart transactions âExecute against BCVs âRe-establish BCV M1 BCV M2

20 Re-Establishing a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on ‘BCV’ updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV” nSynch complete PROD M1 Split BCV Pair BCV M1 UPDATED Re-Establish BCV Pair PROD BCV UPDATED

21 Restore a BCV Pair nBCV pair “PROD” and “BCV” have been split nTracks on “PROD” updated after split nTracks on “BCV” updated after split nSymmetrix keeps table of these “invalid” tracks after split nAt restore BCV pair, “invalid” tracks are written from “BCV to PROD” nSynch complete PROD Split BCV Pair BCV UPDATED Restore BCV Pair PROD BCV UPDATED

22 Make as Many Copies as Needed nEstablish BCV 1 nSplit BCV 1 nEstablish BCV 2 nSplit BCV 2 nEstablish BCV 3 M1 BCV 1 M2 BCV 2 BCV 3 4 PM 5 PM 6 PM

23 The Purpose of SRDF  Local data copies are not enough  Maximalist  Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been.  Minimalist  Provide a means for generating periodic physical backups of the data.

24 Synchronous Data Mirroring  Write is received from the host into the cache of the source  I/O is transmitted to the cache of the target  ACK is provided by the target back to the cache of the source  Ending status is presented to the host  Symmetrix systems destage writes to disk  Useful for disaster recovery

25 Semi-Synchronous Mirroring  An I/O write is received from the host/server into the cache of the source  Ending status is presented to the host/server.  I/O is transmitted to the cache of the target  ACK is sent by the target back to the cache of the source  Each Symmetrix system destages writes to disk  Useful for adaptive copy

26 Backup / Restore of Big Data  Exploding amounts of data cause backups to run on too long  How long does it take you to backup 1 TB of data?  Shrinking backup window and constant pressure for continuous application up-time  Avoid using production environment for backup  No server CPU or I/O channels  No involvement of regular network  Performance must scale to match customer’s growth  Heterogeneous host support

27 Location 1 Location 2 Tape Library SCSI Fastrax Data Engine UNIX Linux Fastrax Enabled Backup/Restore Applications SYMAPI UNIX STD1 STD2 BCV1 R1R2 BCV2 Fibre Channel PtP Link(s) Fastrax Overview

28 Host to Tape Data Flow Symmetrix Fastrax Tape Library Host

29 Fastrax Performance  Performance scales with the number of data movers in the Fastrax box & number of tape devices  Restore runs as fast as backup  No performance impact on host during restore or backup RAF DM SRDF Fastrax

30 Moving Data from Mainframes to UNIX

31 InfoMover File System  Transparent availability of MVS data to Unix hosts  MVS datasets available as native Unix files  Sharing a single copy of MVS datasets  Uses MVS security and locking  Standard MVS access methods for locking + security

32 Mainframe 4 IBM MVS / OS390 Open Systems 4 IBM AIX 4 HP HP-UX 4 Sun Solaris ESCON Channel Parallel Channel FWD SCSI Ultra SCSI Fibre Channel Minimal Network Overhead -- No data transfer over network! -- MVS Data Symmetrix with ESP IFS Implementation

33 Symmetrix API’s

34 Symmetrix API Overview  SYMAPI Core Library  Used by “Thin” and Full Clients  SYMAPI Mapping Library  SYMCLI Command Line Interface

35 Symmetrix API’s  SYMAPI are the high level functions  Used by EMC’s ISV partners (Oracle, Veritas, etc) and by EMC applications  SYMCLI is the “Command Line Interface” which invoke SYMAPI  Used by end customers and some ISV applications.

36 Basic Architecture Symmetrix Application Programming Interface (SymAPI) Symmetrix Command Line Interpreter (SymCli) Other Storage Management Applications User access to the Solutions Enabler is via the SymCli or Storage Management Application

37 Client-Server Architecture  Symapi Server runs on the host computer connected to the Symmetrix storage controller  Symapi client runs on one or more host computers Client Host Server Host Thin Client Host SymAPI library SymAPI Server Storage Management Applications Storage Management Applications SymAPI Client Thin SymAPI Client

38 SymmAPI Components Initialization Discover and Update Configuration Gatekeepers TimeFinder Functions Device Groups DeltaMark Functions SRDF Functions Statistics Mapping FunctionsBase Controls Calypso Controls Optimizer Controls InfoSharing

39 Data Object Resolve RDBMS Data File File System Logical Volume Host Physical Device Symmetrix Device Extents

40 File System Mapping  File System mapping information includes:  File System attributes and host physical location.  Directory attributes and contents.  File attributes and host physical extent information, including inode information, fragment size. I-nodesDirectoriesFile extents

41 Data Center Hosts

42 Solaris & Sun Starfire  Hardware  Up to 62 IO Channels  64 CPU’s  64 GB of RAM  60 TB of disk  Supports multiple domains  Starfire & Symmetrix  ~20% use more than 32 IO channels  Most use 4 to 8 IO channels per domain  Oracle instance usually above 1 TB

43 HPUX & HP 9000 Superdome  Hardware  192 IO Channels  64 CPU’s cards  128 GB RAM  1 PB of storage  Superdome and Symm  16 LUNS per target  Want us to support more than 4000 logical volumes!

44 Solaris and Fujitsu GP7000F M1000  Hardware  6-48 I/O slots  4-32 CPU’s  Cross-Bar Switch  32 GB RAM  64-bit PCI bus  Up to 70TB of storage

45 Solaris and Fujitsu GP7000F M2000  Hardware  12-192 I/O slots  8-128 CPU’s  Cross-Bar Switch  256 GB RAM  64-bit PCI bus  Up to 70TB of storage

46 AIX 5L & IBM RS/6000 SP  Hardware  Scale to 512 Nodes (over 8000 CPUs)  32 TB RAM  473 TB Internal Storage Capacity  High Speed Interconnect 1GB/sec per channel with SP Switch2  Partitioned Workloads  Thousands of IO Channels

47 IBM RS/6000 pSeries 680 AIX 5L  Hardware  24 CPUs 64-bit RS64 IV  600MHz  96 MB RAM  873.3 GB Internal Storage Capacity  53 PCI slots 33 – 32bit/20-64bit

48 Really Big Data  IBM (Sequent) NUMA  16 NUMA “Quads” 4 way/ 450 MHz CPUs 2 GB Memory 4 x 100MB/s FC-SW  Oracle 8.1.5 with up to 42 TB (mirrored) DB  EMC Symmetrix  20 Small Symm 4’s  2 Medium Symm 4’s

49 Windows 2000 on IA32  Usually lots of small (1u or 2u) boxes share a Symmetrix  4 to 8 IO channels per box  Qualified up to 1 TB per meta volume (although usually deployed with ½ TB or less)  Management is a challenge  Will 2000 on IA64 handle big data better?

50 Linux Data Center Wish List

51 Lots of Devices  Customers can uses hundreds of targets and LUN’s (logical volumes)  128 SCSI devices per system is too few  Better naming system to track lots of disks  Persistence for “not ready” devices in the name space would help some of our features  devfs solves some of this  Rational naming scheme  Potential for tons of disk devices (need SCSI driver work as well)

52 Support for Dynamic Data  What happens when the LV changes under a running file system? Adding new logical volumes?  Happens with TimeFinder, RDF, Fastrax  Requires a remounting, reloading drivers, rebooting?  API’s can be used to give “heads up” before events  Must be able to invalidate  Data, name and attribute caches for individual files or logical volumes  Support for dynamically loaded, layered drivers  Dynamic allocation of devices  Especially important for LUN’s  Add & remove devices as fibre channel fabric changes

53 Keep it Open  Open source is good for us  We can fix it or support it if you don’t want to  No need to reverse engineer some closed source FS/LVM  Leverage storage API’s  Add hooks to Linux file systems, LVM’s, sys admin tools  Make Linux manageable  Good management tools are crucial in large data centers

54 New Technology Opportunities  Linux can explore new technologies faster than most  iSCSI  SCSI over TCP for remote data copy?  SCSI over TCP for host storage connection?  High speed/zero-copy TCP is important to storage here!  Infiniband  Initially targeted at PCI replacement  High speed, high performance cluster infrastructure for file systems, LVM’s, etc Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec)  Support for IB as a storage connection?  Cluster file systems

55 Linux at EMC  Full support for Linux in SymAPI, RDF, TimeFinder, etc  Working with partners in the application space and the OS space to support Linux  Oracle Open World Demo of Oracle on Linux with over 20 Symms (could reach 1PB of storage!) EMC Symmetrix Enterprise Storage EMC Connectrix Enterprise Fiber Channel Switch Centralized Monitoring and Management

56 MOSIX and Linux Cluster File Systems

57 Our Problem: Code Builds  Over 70 OS developers  Each developer builds 15 variations of the OS  Each variation compiles over a million lines of code  Full build uses gigabytes of space, with 100k temporary files  User sandboxes stored in home directory over NFS  Full build took around 2 hours  2 users could build at once

58 Our Original Environment  Software  GNU tool chain  CVS for source control  Platform Computing’s Load Sharing Facility  Solaris on build nodes  Hardware  EMC NFS server (Celerra) with EMC Symmetrix back end  26 SUN Ultra-2 (dual 300 MHz CPU) boxes  FDDI ring used for interconnect

59 EMC's LSF Cluster

60 LSF Architecture  Distributed process scheduling and remote execution  No kernel modifications  Prefers to use static placement for load balancing  Applications need to link special library  License server controls cluster access  Master node in cluster  Manages load information  Makes scheduling decisions for all nodes  Uses modified GNU Make (lsmake)

61 MOSIX Architecture  Provide transparent, dynamic migration  Processes can migrate at any time  No user intervention required  Process thinks it is still running on its creation node  Dynamic load balancing  Use decentralized algorithm to continually level load in the cluster  Based on number of CPU's, speed of CPU's, RAM, etc  Worked great for distributed builds in 1989

62 MOSIX Mechanism  Each process has a unique home node  UHN is the node of the processes creation  Process appears to be running at its UHN  Invisible after migration to others on its new node  UHN runs a deputy  Encapsulates system state for migrated process  Acts as a proxy for some location-sensitive system calls after migration  Significant performance hit for IO over NFS, for example

63 MOSIX Migration Link layer deputy remote User level Kernel local process User level Kernel NFS

64 MOSIX Enhancements  MOSIX added static placement and remote execution  Leverage the load balancing infrastructure for placement decisions  Avoid creation of deputies  Lock remotely spawned processes down just in case  Fix several NFS caching related bugs  Modify some of our makefile rules

65 MOSIX Remote Execution Link layer deputy remote User level Kernel local process User level Kernel NFS

66 EMC MOSIX cluster  EMC’s original MOSIX cluster  Compute nodes changed from LSF to MOSIX  Network changed from FDDI to 100 megabit ethernet.  The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems.  Performance was great, but we can do better!

67 Latest Hardware Changes  Network upgrades  New switch deployed  Nodes to switch use 100 megabit ethernet  Switch to NFS server uses gigabit ethernet  NFS upgrades  50 gigabyte, striped file systems per user (compared to 9 gigabyte non-striped file systems)  Fast/wide differential SCSI between server and storage  Cluster upgrades  Added 28 more compute nodes  Added 4 “submittal” nodes

68 EMC MOSIX Cluster Gigabit Ether SCSI

69 Performance  Running Red Hat 6.0 with 2.2.10 kernel (MOSIX and NFS patches applied)  Builds are now around 15-20 minutes (down from 1-1.5 hours)  Over 35 concurrent builds at once

70 Build Submissions

71 Cluster File System & MOSIX Fibre Channel Connectrix

72 DFSA Overview  DFSA provides the structure to allow migrated processes to always do local IO  MFS (MOSIX File System) created  No caching per node, write through  Serverless - all nodes can export/import files  Prototype for DFSA testing  Works like non-caching NFS

73 DFSA Requirements  One active inode/buffer in the cluster for each file  Time-stamps are cluster-wide, increasing  Some new FS operations  Identify: encapsulate dentry info  Compare: are two files the same?  Create: produce a new file from SB/ID info  Some new inode operations  Checkpath: verify path to file is unique  Dotdot: give true parent directory

74 Information  MOSIX:  http://www.mosix.org/ http://www.mosix.org/  GFS:  http://www.globalfilesystem.org/  Migration Information  Process Migration, Milojicic, et al. To appear in ACM Computing Surveys, 2000.  Mobility: Processes, Computers and Agents, Milojicic, Douglis and Wheeler, ACM Press.


Download ppt "Smart Storage and Linux An EMC Perspective Ric Wheeler"

Similar presentations


Ads by Google