Storage in vNEXT Luka Manojlovic,

Storage in vNEXT Luka Manojlovic, luka@manojlovic.netluka@manojlovic.net http://skrci.me/windays15

Software Defined Storage Application data storage on cost effective, continuously available, high performance SMB3 file shares backed by Storage Spaces Disaggregated compute and storage for independent management and scale 1.Performance, Scale: SMB3 File Storage network 2.Continuous Availability and Seamless Scale Out with File Server Nodes 3.Elastic, Reliable, Optimized, Tiered Storage Spaces 4.Standard volume hardware for low cost Storage Spaces Hyper-V Clusters SMB3 Storage Network Fabric Shared JBOD Storage 1 4 22 3 Scale-Out File Server Clusters

Software Defined Storage – Storage Stack Scale-Out File Server Access point for Hyper-V Scale-out data access Data access resiliency Cluster Shared Volumes Single consistent namespace Fast failover Storage Spaces Storage pooling Virtual disks Data Resiliency Hardware - Standard volume hardware - Fast and efficient networking - Shared storage enclosures -SAS SSD -SAS HDD Shared JBOD Storage Software Defined Storage System

Storage Spaces Storage virtualization Storage Pool: unit of aggregation, administration, isolation Storage Space: a virtual disk with resiliency and performance Clustered Storage Pool Pool is read-write on one node, read-only on all other nodes Cluster infrastructure routes pool operations to read-write node Automatic failover if read-write node fails Clustered Storage Space Physical disk resource, online on one node IO is routed to the node where the storage space is online (CSVFS) SMB Client is redirected to the where the storage space is online (SOFS) Automatic failover to another node Allocations are aware of fault-domains (enclosure) Interconnects: Shared SAS for Clusters (SAS, SATA, USB for stand-alone) Enclosures: Shared SAS for Clusters (SAS, SATA, USB for stand-alone)

Storage Spaces Reliability Mirror Resiliency 2-copy mirror, tolerates one drive failure 3-copy mirror, tolerates two drive failures Suitable for random I/O Parity Resiliency Lower cost storage using LRC encoding Tolerates up to 2 drives failures Suitable for large sequential I/O Enclosure awareness Tolerance for entire drive enclosure failure Parallel rebuild Pseudo-random distribution weighted to favor less used disks Reconstructed space is spread widely and rebuilt in parallel Storage Pool Storage Spaces Mirror Space Parity Space Mirror Space Drive Failure! Data is rebuilt to multiple drives simultaneously, using spare capacity Rebuild MetricMeasurement Data Rebuilt2,400 GB Time Taken49 min Rebuild Throughput> 800 MB/s

Storage Spaces Tiering and WBC Tiered Spaces leverage file system intelligence File system measures data activity at sub-file granularity Heat follows files Admin-controlled file pinning is possible Data movement Automated promotion of hot data to SSD tier Configurable scheduled task Write-Back Cache (WBC) Helps smooth effects of write bursts Uses a small amount of SSD capacity IO to SSD bypass WBC Large IO bypass WBC Complementary Together, WBC and the SSD tier address data’s short-term and long-term performance needs Cold Data Hot Data SAS SSD SAS HDD I/O Activity Accumulates Heat at Sub-File Granularity Compute Nodes

Storage Virtualization and Reliability PerformanceScalabilityOperabilityAvailability Tiering Move data to appropriate storage Write-Back Cache Buffer random writes on flash Storage Pools Up to 80 disks per pool 4 pools per cluster 480TB per pool Storage Spaces 64 Storage Spaces per pool Storage Space Resiliency to disk and enclosure failures Parallel data rebuild NTFS Significant improvements Storage Pools Aggregation, administration, isolation PowerShell / SMAPI management Storage Spaces FAQ

CSVFS CSVFS is a clustered file system Enables all nodes to access common volumes Single consistent namespace Provides a layer of abstraction above on-disk file system Application consistent distributed backup Interoperability with backup, AV, Bitlocker Supports NTFS and ReFS on-disk file systems Support for Storage Spaces Transparent fault tolerance Does not require drive ownership changes on failover No dismounting and remounting of volumes Faster failover times (aka. less downtime) Workloads Hyper-V and SQL Server Scale-Out File Server Storage Node 1 Storage Node 2 SMB Server (CSV) CSV Filter NTFS / ReFS Volume Space CSVFS CSV Volume SMB Server (default) SMB Client (CSV) CSVFS CSV Volume SMB Server (default) Metadata or redirected IO Block redirected IO Direct IO

CSVFS components CSV File System (CSVFS) Proxy file system on top of NTFS or ReFS Mounted on every node Decides direct IO vs. file system redirect IO CSV Volume Manager Responsible for the creation of CSV volumes Direct IO for locally attached spaces and Block-level IO redirect for non-locally attached spaces CSV Filter Attaches to NTFS / ReFS for local clustered spaces Controls access to the on-disk file system Co-ordinates metadata operations CSVFS support for NTFS features and operations Storage Node 1 Storage Node 2 SMB Server (CSV) CSV Filter NTFS / ReFS Volume Space CSVFS CSV Volume SMB Server (default) SMB Client (CSV) CSVFS CSV Volume SMB Server (default) Metadata or redirected IO Block redirected IO Direct IO

CSVFS Zero Downtime CHKDSK Improved CHKDSK with online scanning separated from offline repair With CSV repair is also online CHKDSK processing with CSV Cluster checks (once a minute) to see if CHKDSK (spotfix) is required Cluster pauses the affected CSV file system and dismounts the underlying NTFS volume CHKDSK (spotfix) is run against only affected files for a maximum of 15 seconds The underlying NTFS volume is mounted and CSV namespace is un-paused If CHKDSK (spotfix) did not process all records Cluster will wait 3 minutes before continuing Enables a large set of affected files to be processed over time If corruption is too large CHKDSK (spotfix) is not run and marked to run at next Physical Disk online

Cluster-Wide File System PerformanceScalabilityOperabilityAvailability CSV Block Cache 7x faster VDI VM boot time (avg.) Block level I/O redirection Single namespace Aggregate all file systems No more drive letters Leverages mount points Fault tolerance Fast failover on failures Zero downtime CHKDSK Shared access All nodes can access volumes Simple management Manage from any node

SMB Transparent Failover Planned and unplanned failovers with zero downtime SMB Client Server node failover is transparent to client side applications Small IO delay during failover SMB Server Handles are always opened Write-Through Stores handle state in Resume Key Filter (RKF) DB Resume Key Persists protocol server state to file Reconciles handle reconnects/replays with local file system state Protects file state during reconnect window Witness Service Clients proactively notified of server node failures Clients can be instructed to switch server nodes SMB Server Witness Service RVSS Service DNN Resource Cluster SOFS Node 1 CSVFS Resume Key Shared VHD NTFS/ReFS LUN / Space DBDB SOFS Resource CSV Provider VSS Server Service To Witness Service On SOFS node 2+ NIC Pair 1 NIC Pair N B/W Limiter VMVM VHD Parser SMB Client WSK VMBUS SMBD TCP VMVM Witness Client RVSS Provider VSS DPM Hyper-V Host LBFO NDK

SMB Scale-Out Scaling out for throughput and management Scale Out Active-Active SMB shares accessible through all nodes simultaneously Distributed NetName (DNN) Physical node IP addresses in DNS Client round-robins through IPs (multiple parallel connects) Clients re-directed to “optimal” server node (CSV/Storage space owner) Management / Backup Simple Management Extensive PowerShell Fan out requests Remote VSS (MS-FSRVP) SMB Server Witness Service RVSS Service DNN Resource Cluster SOFS Node 1 CSVFS Resume Key Shared VHD NTFS/ReFS LUN / Space DBDB SOFS Resource CSV Provider VSS Server Service To Witness Service On SOFS node 2+ NIC Pair 1 NIC Pair N B/W Limiter VMVM VHD Parser SMB Client WSK VMBUS SMBD TCP VMVM Witness Client RVSS Provider VSS DPM Hyper-V Host LBFO NDK

SMB Multichannel Network throughput and fault tolerance Bandwidth aggregation & link fault tolerance IO balanced over active interfaces Replays operations on alternate channels in channel failure cases RSS aware, LBFO aware, NUMA aware Zero configuration Client driven NIC discovery and best pair(s) selection Transparent fall back to less desirable interfaces in failure cases Periodic re-evaluation and transparent ‘upgrade’ SMB Server Witness Service RVSS Service DNN Resource Cluster SOFS Node 1 CSVFS Resume Key Shared VHD NTFS/ReFS LUN / Space DBDB SOFS Resource CSV Provider VSS Server Service To Witness Service On SOFS node 2+ NIC Pair 1 NIC Pair N B/W Limiter VMVM VHD Parser SMB Client WSK VMBUS SMBD TCP VMVM Witness Client RVSS Provider VSS DPM Hyper-V Host LBFO NDK SMB Client CSV traffic to node 2+

SMB Direct Low network latency and low CPU consumption SMB Direct Provides sockets-like layer over NDK / RDMA Low latency – combination of fabric and skipping TCP stack Supports RoCE, iWARP and InfiniBand Efficient – cycles/byte comparable with DAS Results >1 Million 8K IOPS demonstrated by Violin Memory 16 GB/s large IOs (multiple InfiniBand links) with low CPU Also utilized by (SMB Direct + SMB Multichannel): Hyper-V for live Migration with bandwidth limiter to avoid starving LM traffic CSVFS for internal traffic SMB Server Witness Service RVSS Service DNN Resource Cluster SOFS Node 1 CSVFS Resume Key Shared VHD NTFS/ReFS LUN / Space DBDB SOFS Resource CSV Provider VSS Server Service To Witness Service On SOFS node 2+ NIC Pair 1 NIC Pair N B/W Limiter VMVM VHD Parser SMB Client WSK VMBUS SMBD TCP VMVM Witness Client RVSS Provider VSS DPM Hyper-V Host LBFO NDK SMB Client CSV traffic to node 2+

Scale-Out File Server PerformanceScalabilityOperabilityAvailability SMB Direct Low latency and minimal CPU usage SMB Performance Optimized for server app IO profiles SMB Scale-Out Active/Active file shares SMB Multichannel Network bandwidth aggregation SMB Transparent Failover Node failover transparent to VMs SMB Multichannel Network fault tolerance SMB PowerShell Manage from any node SMB Analysis Performance counters

Windows cluster in a box

And we only discussed the stuff in blue Cluster-Aware Updating SMB3 & SMB Direct Virtual Fibre Channel Hyper-V Replica 8,000 VMs per Cluster VM Prioritization 64-node clusters Dedup Scale-Out File Server Storage Spaces Offload Data Transfer VM Storage Migration iSCSI Target Server ReFS VHDX Shared VHDX Hyper-V Storage QoS Work Folders SMI-S Storage Service NTFS Trim / Unmap NFS 4.1 Server SM API CSVFS online CHKDSK iSCSI Target Server with VHDX Dedup (live files/CSV) SMB Direct (> 1M IOPs) Live Migration over SMB Optimized Scale- out File Server Storage Spaces Tiering Storage Spaces Write Back Cache Storage Spaces Rebuild SMB Bandwidth Management Windows Server 2012 Windows Server 2012 R2

Dell PowerEdge servers Dell Storage Dell Networking Tightly integrated components Windows Server 2012 R2, System Center 2012 R2, Windows Azure Pack Microsoft-designed architecture based on Public Cloud learning Microsoft-led support & orchestrated updates Optimized run-books for Microsoft applications Microsoft Cloud Platform System powered by Dell

Pre-deployed infrastructure Switches, load balancer, storage, compute, network edge N+2 fault tolerant (N+1 networking) Pre-configured as per best practices Integrated Management Configure, deploy, patching Monitoring Backup and DR Automation 8000 VM’s*, 1.1 PB of total storage Optimized deployment and operations for Microsoft and other standard workloads Cloud Platform System - Capabilities SQL Server SYSTEM CENTER SMB 3.0 & STORAGE SPACES HYPER-V HOSTS HYPER-V NETWORKING SERVICE MANAGEMENT API ADMIN PORTAL TENANT PORTAL Dell PowerEdge Servers Dell Storage Dell Networking + optimized racking and cabling for high density and reliability Dell Enterprise infrastructure WINDOWS AZURE PACK

Storage Cluster (Storage Scale Unit) Storage Scale Unit hardware (4x4) 4x Dell PowerEdge R620v2 Servers Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz) 128GB memory 2x LSI 9207-8E SAS Controllers (shared storage) 2x 10 GbE Chelsio T520 (iWARP/RDMA) 4x PowerVault MD3060e JBODs 48x 4TB HDDs  192HDD / 768TB RAW 12x 800GB SSDs  48SSD / 38TB RAW Storage Spaces configuration 3 x pools (2 x tenant, 1 x backup) VM storage: 16x enclosure aware, 3-copy mirror @ ~9.5TB Automatic tiered storage, write-back cache Backup storage: 16x enclosure aware, dual parity @ 7.5TB SSD for logs only 24TB of HDD and 3.2TB of SSD capacity left unused for automatic rebuild Available space Tenant: 156 terabytes Backup: 126 terabytes (+ deduplication) \\SMBShare

Cloud Platform System (CPS) Integrated solution for HW and SW Networking – 5 x Force 10 – S4810P (64 port @ 10GbE - Data) – 1 x Force 10 – S55 (48 port @ 1GbE – Management) Compute Scale Unit (32 x Hyper-V hosts) – Dell PowerEdge C6220ii – 4 Compute Nodes per 2U Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz) 256 GB memory 2 x 10 GbE Mellanox NIC’s (LBFO Team, NVGRE offload) 2 x 10 GbE Chelsio (iWARP/RDMA) 1 local SSD @ 200 GB (boot/paging) Storage Scale Unit (4 x File Servers, 4 x JBODS) – Dell PowerEdge R620v2 Servers (4 Server for Scale Out File Server) Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz) 2 x LSI 9207-8E SAS Controllers (shared storage) 2 x 10 GbE Chelsio T520 (iWARP/RDMA) – PowerVault MD3060e JBODs (48 HDD, 12 SSD) – 4 TB HDDs and 800 GB SSDs

Cluster Rolling Upgrades for Storage Seamless Zero downtime cloud upgrades for Hyper-V and Scale-out File Server Simple Easily roll in nodes with new OS version Windows Server 2012 R2 and Windows Server vNext nodes within the same cluster

Storage Replica BCDR Synchronous or asynchronous Cluster Server Microsoft Azure Site Recovery orchestration Stretch Cluster Synchronous stretch clusters across sites for HA Benefits Block-level, host-based, volume replication End-to-end software stack from Microsoft Works with any Windows volume Hardware agnostic; existing SANs work Uses SMB3 as transport 2 1 Available in Windows Server Technical Preview for Stretch Cluster and Server to Server scenarios. Management tools are still in progress.

Multi-Site Cluster Site1 Site2 Flexible Works with any Windows volume, uses SMB3 as transport Hardware agnostic - works with Storage Spaces or any SAN volume Integrated management End-to-end Windows Server disaster recovery solution Failover Cluster Manager UI and PowerShell Scalable Block-level synchronous volume replication Automatic cluster failover for low Recovery Time Objective (RTO) Cross site HA DR: Stretch clusters across sites with synchronous volume replication Storage Replica – More uptime

ModeDeploymentDiagramSteps Synchronous Zero Data Loss RPO Mission critical apps On-Prem or Metro setup Short distance ( <5ms, more likely <30km) Usually dedicated link Bigger bandwidth 1.Application write 2.Log data written & the data is replicated to remote site 3.Log data written at the remote site 4.Acknowledgement from the remote site 5.Application write acknowledged t, t 1 - Data flushed to the volume, logs always write through Asynchronous Near zero data loss (depends on multiple factors) RPO Non-critical apps Across region / country Unlimited distance Usually over WAN 1.Application write 2.Log data written 3.Application write acknowledged 4.Data replicated to the remote site 5.Log data written at the remote site 6.Acknowledgement from the remote site t, t 1 - Data flushed to the volume, logs always write through Applications (Primary) Applications (Primary) Server Cluster (SR) Server Cluster (SR) Data Log 1 1 t t 2 2 Applications (Remote) Applications (Remote) Server Cluster (SR) Server Cluster (SR) Data Log t1t1 t1t1 3 3 2 2 5 5 4 4 Applications (Primary) Applications (Primary) Server Cluster (SR) Server Cluster (SR) Data Log 1 1 t t 2 2 Applications (Remote) Applications (Remote) Server Cluster (SR) Server Cluster (SR) Data Log 5 5 4 4 3 3 6 6 t1t1 t1t1 Storage Replica : Sync and Async modes

Scale-out File Server Cluster Hyper-V Cluster Virtual Machines I/O Sched I/O Sched I/O Sched Policy Manager Rate Limiters Rate Limiters Rate Limiters Rate Limiters SMB3 Storage Network Fabric Control and monitor storage performance Flexible and customizable Policy per VHD, VM, Service or Tenant Define Minimum & Maximum IOPs Fair distribution within policy Simple out of box behavior Enabled by default for Scale Out File Server Automatic metrics (normalized IOPs & latency) per VM & VHD Management System Center VMM and Ops Manager PowerShell built-in for Hyper-V and SOFS Storage QoS – Greater efficiency

Storage Spaces Shared Nothing Enabling cloud hardware designs Support for DAS (shared nothing) storage hardware Prescriptive configurations Scalable pools Supports large pools Simple storage expansion and rebalancing Fault tolerance Fault tolerance to disk, enclosure and node failures 3-copy mirror and dual parity Management System Center and PowerShell Key use cases Hyper-V IaaS storage Storage for Backup and Replication targets Doesn’t need shared JBODs and SAS fabric behind Scale Out File Server nodes Scale-Out File Server Hyper-V Clusters SMB3 Storage Network Fabric Shared JBOD Storage

Hyper-V Cluster(s) SMB3 Storage Network Fabric SoFS clusters with no shared storage. Doesn’t need shared JBODs and SAS fabric behind Scale Out File Server nodes Reliability, Scalability, Flexibility Fault tolerance to disk, enclosure, node failures Scale pools to large number of drive Fine-grained storage expansion Cloud design points and management Prescriptive configuration. Reduced hardware costs with SATA drives Deploy, manage and monitor with SCVMM,SCOM Use Cases Hyper-V IaaS storage Storage for Backup and Replication targets Storage Spaces Shared Nothing – Low cost

Storage in vNEXT Luka Manojlovic,

Similar presentations

Presentation on theme: "Storage in vNEXT Luka Manojlovic,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Storage in vNEXT Luka Manojlovic,

Similar presentations

Presentation on theme: "Storage in vNEXT Luka Manojlovic,"— Presentation transcript:

Similar presentations

About project

Feedback