Presentation on theme: "Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf."— Presentation transcript:
Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf (LBNL), Mahmut Kandemir (PSU)
Executive Summary Challenge: SSD array might not be suitable for a high- performance computing storage Our goal: propose a new high-performance storage architecture Observation – High maintenance cost: caused by worn-out flash-SSD replacements – Performance degradation: caused by shared resource contentions Key Ideas – Cost reduction: by taking bare NAND flash out from SSD box – Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture for HPC storages – Consists of non-SSD bare flash memories – Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array
SSD Arrays SSD arrays are in position to (partially) replace HDD arrays HPC starts to employ SSDs HDD Arrays HPC SSD-Cache on HDD Arrays SSD-buffer on Compute-Node
As time goes by, worn-out SSD should be replaced The thrown-away SSD has complex internals Other parts are still useful, only flash memories are useless High-cost Maintenance of SSD Arrays SSD Arrays Worn-out! Abandon! Replace! Dead! Live!
I/O Services Suffered in SSD Arrays Varying data locality in an array, which consist of 80 SSDs Hot region is a group of SSDs having 10% of total data Arrays without a hot region show reasonable latency As the number of hot regions increases, the performance of SSD arrays degrades
Why Latency Delayed? Link Contention A single data bus is shared by a group of SSDs When target SSD is ready and the shared bus is idle, the I/O request can get service right away When excessive I/Os destined to a specific group of SSDs SSD-1SSD-2SSD-3SSD-4SSD-5SSD-6SSD-7SSD-8 Dest- 3 Dest-8 Dest- 4Dest- 1Dest- 2 Dest-6 READY! IDLE!
Why Latency Delayed? Link Contention When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer This stall is because SSDs in a group share a data bus link contention SSD-1SSD-2SSD-3SSD-4SSD-5SSD-6SSD-7SSD-8 Dest- 4Dest- 1Dest- 2 READY! Dest-6 READY! IDLE! BUSY! STALL
Why Latency Delayed? Storage Contention When excessive I/Os destined to a specific SSD SSD-1SSD-2SSD-3SSD-4SSD-5SSD-6SSD-7SSD-8 Dest- 3 Dest-8 READY! Dest-8 Dest-1Dest-2Dest-4 BUSY!
Why Latency Delayed? Storage Contention When excessive I/O destined to a specific SSD When the target SSD is busy, even though the link is available, I/O request should stay in buffer This stall is because a specific SSD is continuously busy storage contention SSD-1SSD-2SSD-3SSD-4SSD-5SSD-6SSD-7SSD-8 Dest-3 Dest-8 READY! Dest-8 Dest-1Dest-2Dest-4 BUSY! Dest-8 BUSY! STALL READY!BUSY!
Unboxing SSD for Cost Reduction Worn-out flash packages should be replaced Many logics in SSDs including H/W controllers and firmware are wasted, when worn-out SSDs are replaced Instead of a whole SSD, let’s use only bare flash packages Bare NAND flash packages Still useful, so reusable Useless, so replaced SSD Internal Host interface controller Flash controllers Microprocessors DRAM buffers Firmware 35~50% of total SSD cost
Use of Unboxed Flash Packages, FIMM Multiple NAND flash packages integrated into a board – Looks like passive memory device such as DIMM – Referred to as Flash Inline Memory Module (FIMM) Control-signals and pin-assignment are defined For convenient replacement of worn-out FIMMs – A FIMM has hot swappable connector – NV-DDR2 interface design by ONFi Flash Package
How FIMMs Connected? PCI-E PCI-E technology provides high-performance interconnect Root complex – I/O start point component Switch – middle-layer components Endpoint – where FIMMs directly attached Link – bus connecting components FIMM HPC root complex switch endpoint FIMM
Connection between FIMMs and PCI-E PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet – Front-end PCI-E protocol for PCI-E fabric – Back-end ONFi NV-DDR2 interface for FIMMs Endpoint consists of three parts – PCI-E device layers: handle PCI-E interface – Control logic: handles FIMMs over ONFi interface – Upstream/downstream buffers: control traffic communication
Connection between FIMMs and PCI-E Communication example – (1) PCI-E packet arrived at target endpoint – (2) PCI-E device layers disassemble the packet – (3) The disassembled packet is enqueued into downstream buffer – (4) HAL dequeues the packet and constructs a NAND flash command Hot-swappable connector for FIMMs – ONFi 78-pin NV-DDR2 slot
Triple-A Architecture PCI-E allows architect to configure any configuration Endpoint is where FIMMs are directly attached Triple-A comprises a set of FIMMs using PCI-E Useful parts of SSDs are aggregated on top of PCI-E fabric PCI-E Fabric Endpoint FIMM Endpoint FIMM Multi-cores DRAMs SwitchesRCs
Triple-A Architecture Flash control logic is also moved out of SSD internal – Address translation, garbage collection, IO scheduler, and so on – Autonomic I/O contention managements Triple-A architectures interact with hosts or compute nodes PCI-E Fabric Endpoint FIMM Endpoint FIMM Multi-cores DRAMsManagement Module HostsCNs RCsSwitches
Link Contention Management (1) Hot cluster detection – I/O stalled due to link contention FIMM End Point FIMM End Point End Point End Point FIMM PCI-E Switch shared data bus FIMM BUSY!!! Hot Cluster
Link Contention Management (1)Hot cluster detection – I/O stalled due to link contention (2)Cold cluster securement – clusters with free link (3)Autonomic data migration – from hot to cold cluster FIMM End Point FIMM End Point End Point End Point FIMM PCI-E Switch FIMM BUSY!!! Hot Cluster FIMM IDLE!!! Cold Cluster Shadow-cloning can hide the migration overheads
Storage Contention Management (1)Laggard detection – I/O stalled due to storage contention (2)Autonomic data-layout reshaping for stalled I/O in queue End Point FIMM-1 Switch FIMM-2FIMM-3FIMM-4 REQ-3 REQ-1 QUEUE REQ-3 Stalled REQ-3 REQ-2 Issued FIMM-1FIMM-2FIMM-3FIMM-4 Stalled End Point FIMM … Laggard
Storage Contention Management (1)Laggard detection – I/O stalled due to storage contention (2)Autonomic data-layout reshaping for stalled I/O in queue Write I/O – physical data-layout reshaping (to no-laggard neighbors) Read I/O – shadow copying (to no-laggard neighbors) & reshaphing End Point FIMM-1 Switch FIMM-2FIMM-3FIMM-4 REQ-3 REQ-1 QUEUE REQ-3 Stalled REQ-3 REQ-2 Issued FIMM-1FIMM-2FIMM-3FIMM-4 Stalled End Point FIMM … Laggard 3 4 3 2 3 4 Issued
Experimental Setup Flash array network simulation model – Captures PCI-E specific characteristics Data movement delay, switching and routing latency (PLX 3734), contention cycles – Configures diverse system parameters – Will be available in the public (preparing an open-source framework) Baseline all-flash array configuration – 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB – 80 clusters, 320 FIMM network evaluation Workloads – Enterprise workloads (cfs, fin, hm, mds, msnfs, …) – HPC workload (eigensolver simulated at LBNL supercomputer) – Micro-benchmarks (read/write, sequential/random)
Latency Improvement Triple-A latency normalized to non-autonomic all-flash array Real-world workloads: enterprise and HPC I/O traces On average, x5 shorter latency Specific workloads (cfs and web) generate no hot clusters
Throughput Improvement Triple-A IOPS normalized for system throughput On average, x6 higher IOPS Specific workloads (cfs and web) generate no hot clusters Triple-A boosts the storage system by resolving contentions
Queue Stall Time Decrease Queue stall time come from two resource contentions On average, stall time shortened by 81% According to our analysis, Triple-A decreases dramatically link-contention time msnfs shows low I/O ratio on hot clusters
Network Size Sensitivity By increasing the number of clusters (endpoints) Execution time broken-down into stall times and storage lat. Triple-A shows better performance on larger networks – PCI-E components stall times are effectively reduced – FIMM latency is out of Triple-A’s concern non-autonomic arrayTriple-A
Related Works (1) Market products (SSD array) – [Pure Storage] one-large pool storage system with 100% NAND flash based SSDs – [Texas Memory Systems] 2D flash-RAID – [Violin Memory] flash memory array of 1000s of flash cells Academia study (SSD array) – [A.M.Caulfield, ISCA’13] proposed SSD-based storage area network (QuickSAN) by integrating network adopter into SSDs – [A.Akel, Hotstorage’11] proposed a prototype of PCM based storage array (Onyx) – [A.M.Caulfield, MICRO’10] proposed a high-performance storage array architecture for emerging non-volatile memories
Related Works (2) Academia study (SSD RAID) – [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for better reliability by creating age disparities within arrays – [S.Moon, Hotstorage’13] investigated the effectiveness of SSD-based RAID and discussed the reliability potential Academia study (NVM usage for HPC) – [A.M.Caulfield, ASPLOS’09] exploited flash memory to clusters for the performance and power consumption – [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC
Conclusions Challenge: SSD array might not be suitable for high- performance storage Our goal: propose a new high-performance storage architecture Observation – High maintenance cost: caused by worn-out flash-SSD replacements – Performance degradation: caused by shared resource contentions Key Ideas – Cost reduction: by taking bare NAND flash out from SSD box – Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture suitable for HPC storages – Consists of non-SSD bare flash memories – Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays
Data Migration Overhead A data migration comprises a series of – (1) Data read from source FIMM – (2) Data move to parental switch – (3) Data move to target endpoint – (4) Data write to target FIMM Naïve data migration activity shares all-flash array resources with normal I/O requests – I/O latency delayed due to resource contention Naïve migration
Data Migration Overhead Data read of data migration (first step) hurts seriously the system performance Shadow cloning overlaps normal read I/O request and data read of data migration Shadow cloning successfully hides the data migration overhead and minimizes the system performance degradation Naïve migration Shadow cloning
Real Workload Latency (1) CDF of workload latency for non-autonomic all-flash array and Triple-A Triple-A significantly improves I/O request latency Relatively low latency improvement in msnfs – Ratio of I/O requests heading to hot clusters is not very high – Hot clusters detected, but not that hot (less hot) proj msnfs
Real Workload Latency (2) Prxy experiences great latency improved by Triple-A Websql did not get more benefit than expected – Due to more and hotter clusters than proxy – But, all hot clusters are located in the same switch In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A prxy websql
Network Size Sensitivity Triple-A successfully reduces both contention time – By distributing extra load of hot clusters – Data migration and physical data reshaping Link contention time is all most completely eliminated Storage contention time is steadily reduced – It is bounded by the number of I/O requests to target clusters Normalized to non-autonomic all-flash arrays
Why Latency Delayed? Storage Contention Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O) When the SSD where an I/O destined is ready, I/O can get service right away When the SSD where an I/O destined is busy, I/O should wait SSD-1SSD-2SSD-3SSD-4SSD-5SSD-6SSD-7SSD-8 Dest- 3 Dest-8 READY! BUSY!