Myoungsoo Jung (UT-Dallas) Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf (LBNL), Mahmut Kandemir (PSU)
Executive Summary Challenge: SSD array might not be suitable for a high-performance computing storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array
Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions
HPC starts to employ SSDs SSD-Cache on HDD Arrays HPC SSD-buffer on Compute-Node HDD Arrays SSD Arrays SSD arrays are in position to (partially) replace HDD arrays
High-cost Maintenance of SSD Arrays Abandon! Replace! Live! Worn-out! Dead! As time goes by, worn-out SSD should be replaced The thrown-away SSD has complex internals Other parts are still useful, only flash memories are useless
I/O Services Suffered in SSD Arrays Varying data locality in an array, which consist of 80 SSDs Hot region is a group of SSDs having 10% of total data Arrays without a hot region show reasonable latency As the number of hot regions increases, the performance of SSD arrays degrades
Why Latency Delayed? Link Contention IDLE! SSD-1 SSD-2 SSD-3 SSD-4 Dest- 1 Dest- 3 Dest- 2 Dest- 4 READY! IDLE! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-6 READY! A single data bus is shared by a group of SSDs When target SSD is ready and the shared bus is idle, the I/O request can get service right away When excessive I/Os destined to a specific group of SSDs
Why Latency Delayed? Link Contention STALL IDLE! BUSY! Dest- 4 Dest- 1 SSD-1 SSD-2 SSD-3 SSD-4 Dest- 2 READY! READY! READY! BUSY! IDLE! Dest-6 SSD-5 SSD-6 SSD-7 SSD-8 READY! When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer This stall is because SSDs in a group share a data bus link contention
Why Latency Delayed? Storage Contention SSD-1 SSD-2 SSD-3 SSD-4 Dest-1 Dest-4 Dest- 3 READY! BUSY! Dest-2 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 Dest-8 BUSY! READY! Dest-8 When excessive I/Os destined to a specific SSD
Why Latency Delayed? Storage Contention Dest-4 Dest-1 SSD-1 SSD-2 SSD-3 SSD-4 Dest-3 Dest-2 READY! BUSY! BUSY! BUSY! READY! STALL Dest-8 Dest-8 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 BUSY! When excessive I/O destined to a specific SSD When the target SSD is busy, even though the link is available, I/O request should stay in buffer This stall is because a specific SSD is continuously busy storage contention
Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions
Unboxing SSD for Cost Reduction Still useful, so reusable Useless, so replaced Host interface controller Flash controllers 35~50% of total SSD cost Microprocessors DRAM buffers Firmware Bare NAND flash packages SSD Internal Worn-out flash packages should be replaced Many logics in SSDs including H/W controllers and firmware are wasted, when worn-out SSDs are replaced Instead of a whole SSD, let’s use only bare flash packages
Use of Unboxed Flash Packages, FIMM Multiple NAND flash packages integrated into a board Looks like passive memory device such as DIMM Referred to as Flash Inline Memory Module (FIMM) Control-signals and pin-assignment are defined For convenient replacement of worn-out FIMMs A FIMM has hot swappable connector NV-DDR2 interface design by ONFi
How FIMMs Connected? PCI-E root complex HPC FIMM FIMM FIMM switch switch switch FIMM FIMM endpoint endpoint FIMM FIMM FIMM FIMM FIMM PCI-E technology provides high-performance interconnect Root complex – I/O start point component Switch – middle-layer components Endpoint – where FIMMs directly attached Link – bus connecting components
Connection between FIMMs and PCI-E PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet Front-end PCI-E protocol for PCI-E fabric Back-end ONFi NV-DDR2 interface for FIMMs Endpoint consists of three parts PCI-E device layers: handle PCI-E interface Control logic: handles FIMMs over ONFi interface Upstream/downstream buffers: control traffic communication
Connection between FIMMs and PCI-E Communication example (1) PCI-E packet arrived at target endpoint (2) PCI-E device layers disassemble the packet (3) The disassembled packet is enqueued into downstream buffer (4) HAL dequeues the packet and constructs a NAND flash command Hot-swappable connector for FIMMs ONFi 78-pin NV-DDR2 slot
Triple-A Architecture Multi-cores DRAMs PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM PCI-E allows architect to configure any configuration Endpoint is where FIMMs are directly attached Triple-A comprises a set of FIMMs using PCI-E Useful parts of SSDs are aggregated on top of PCI-E fabric
Triple-A Architecture Hosts CNs Multi-cores DRAMs Management Module PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM Flash control logic is also moved out of SSD internal Address translation, garbage collection, IO scheduler, and so on Autonomic I/O contention managements Triple-A architectures interact with hosts or compute nodes
Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions
Link Contention Management shared data bus PCI-E Switch End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM (1) Hot cluster detection – I/O stalled due to link contention
Link Contention Management Cold Cluster PCI-E Switch End Point FIMM FIMM FIMM FIMM FIMM IDLE!!! End Point FIMM FIMM FIMM FIMM Shadow-cloning can hide the migration overheads End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM Hot cluster detection – I/O stalled due to link contention Cold cluster securement – clusters with free link Autonomic data migration – from hot to cold cluster
Storage Contention Management Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled REQ-3 … Stalled REQ-3 Issued REQ-1 End Point Stalled REQ-3 FIMM FIMM FIMM FIMM Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue
Storage Contention Management Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled Issued REQ-3 3 4 … Stalled Issued REQ-3 3 2 Issued REQ-1 End Point Stalled Issued REQ-3 FIMM FIMM FIMM FIMM 3 4 Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue Write I/O – physical data-layout reshaping (to no-laggard neighbors) Read I/O – shadow copying (to no-laggard neighbors) & reshaphing
Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions
Experimental Setup Flash array network simulation model Captures PCI-E specific characteristics Data movement delay, switching and routing latency (PLX 3734), contention cycles Configures diverse system parameters Will be available in the public (preparing an open-source framework) Baseline all-flash array configuration 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB 80 clusters, 320 FIMM network evaluation Workloads Enterprise workloads (cfs, fin, hm, mds, msnfs, …) HPC workload (eigensolver simulated at LBNL supercomputer) Micro-benchmarks (read/write, sequential/random) 150 ns, 1000 queues
Latency Improvement Triple-A latency normalized to non-autonomic all-flash array Real-world workloads: enterprise and HPC I/O traces On average, x5 shorter latency Specific workloads (cfs and web) generate no hot clusters
Throughput Improvement Triple-A IOPS normalized for system throughput On average, x6 higher IOPS Specific workloads (cfs and web) generate no hot clusters Triple-A boosts the storage system by resolving contentions
Queue Stall Time Decrease Queue stall time come from two resource contentions On average, stall time shortened by 81% According to our analysis, Triple-A decreases dramatically link-contention time msnfs shows low I/O ratio on hot clusters
Network Size Sensitivity non-autonomic array Triple-A By increasing the number of clusters (endpoints) Execution time broken-down into stall times and storage lat. Triple-A shows better performance on larger networks PCI-E components stall times are effectively reduced FIMM latency is out of Triple-A’s concern
Related Works (1) Market products (SSD array) [Pure Storage] one-large pool storage system with 100% NAND flash based SSDs [Texas Memory Systems] 2D flash-RAID [Violin Memory] flash memory array of 1000s of flash cells Academia study (SSD array) [A.M.Caulfield, ISCA’13] proposed SSD-based storage area network (QuickSAN) by integrating network adopter into SSDs [A.Akel, Hotstorage’11] proposed a prototype of PCM based storage array (Onyx) [A.M.Caulfield, MICRO’10] proposed a high-performance storage array architecture for emerging non-volatile memories
Related Works (2) Academia study (SSD RAID) [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for better reliability by creating age disparities within arrays [S.Moon, Hotstorage’13] investigated the effectiveness of SSD-based RAID and discussed the reliability potential Academia study (NVM usage for HPC) [A.M.Caulfield, ASPLOS’09] exploited flash memory to clusters for the performance and power consumption [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC
Conclusions Challenge: SSD array might not be suitable for high-performance storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture suitable for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays
Backup
Data Migration Overhead Naïve migration A data migration comprises a series of (1) Data read from source FIMM (2) Data move to parental switch (3) Data move to target endpoint (4) Data write to target FIMM Naïve data migration activity shares all-flash array resources with normal I/O requests I/O latency delayed due to resource contention
Data Migration Overhead Naïve migration Shadow cloning Data read of data migration (first step) hurts seriously the system performance Shadow cloning overlaps normal read I/O request and data read of data migration Shadow cloning successfully hides the data migration overhead and minimizes the system performance degradation
Real Workload Latency (1) proj msnfs CDF of workload latency for non-autonomic all-flash array and Triple-A Triple-A significantly improves I/O request latency Relatively low latency improvement in msnfs Ratio of I/O requests heading to hot clusters is not very high Hot clusters detected, but not that hot (less hot)
Real Workload Latency (2) prxy websql Prxy experiences great latency improved by Triple-A Websql did not get more benefit than expected Due to more and hotter clusters than proxy But, all hot clusters are located in the same switch In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A
Network Size Sensitivity non-autonomic Normalized to all-flash arrays Triple-A successfully reduces both contention time By distributing extra load of hot clusters Data migration and physical data reshaping Link contention time is all most completely eliminated Storage contention time is steadily reduced It is bounded by the number of I/O requests to target clusters
Why Latency Delayed? Storage Contention SSD-1 SSD-2 SSD-3 SSD-4 Dest- 3 READY! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 BUSY! Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O) When the SSD where an I/O destined is ready, I/O can get service right away When the SSD where an I/O destined is busy, I/O should wait