Myoungsoo Jung (UT-Dallas)

Myoungsoo Jung (UT-Dallas)
Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf (LBNL), Mahmut Kandemir (PSU)

Executive Summary Challenge: SSD array might not be suitable for a high-performance computing storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array

Outline Motivations Triple-A Architecture Triple-A Management
Evaluations Conclusions

HPC starts to employ SSDs
SSD-Cache on HDD Arrays HPC SSD-buffer on Compute-Node HDD Arrays SSD Arrays SSD arrays are in position to (partially) replace HDD arrays

High-cost Maintenance of SSD Arrays
Abandon! Replace! Live! Worn-out! Dead! As time goes by, worn-out SSD should be replaced The thrown-away SSD has complex internals Other parts are still useful, only flash memories are useless

I/O Services Suffered in SSD Arrays
Varying data locality in an array, which consist of 80 SSDs Hot region is a group of SSDs having 10% of total data Arrays without a hot region show reasonable latency As the number of hot regions increases, the performance of SSD arrays degrades

Why Latency Delayed? Link Contention
IDLE! SSD-1 SSD-2 SSD-3 SSD-4 Dest- 1 Dest- 3 Dest- 2 Dest- 4 READY! IDLE! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-6 READY! A single data bus is shared by a group of SSDs When target SSD is ready and the shared bus is idle, the I/O request can get service right away When excessive I/Os destined to a specific group of SSDs

Why Latency Delayed? Link Contention
STALL IDLE! BUSY! Dest- 4 Dest- 1 SSD-1 SSD-2 SSD-3 SSD-4 Dest- 2 READY! READY! READY! BUSY! IDLE! Dest-6 SSD-5 SSD-6 SSD-7 SSD-8 READY! When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer This stall is because SSDs in a group share a data bus  link contention

Why Latency Delayed? Storage Contention
SSD-1 SSD-2 SSD-3 SSD-4 Dest-1 Dest-4 Dest- 3 READY! BUSY! Dest-2 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 Dest-8 BUSY! READY! Dest-8 When excessive I/Os destined to a specific SSD

Dest-4 Dest-1 SSD-1 SSD-2 SSD-3 SSD-4 Dest-3 Dest-2 READY! BUSY! BUSY! BUSY! READY! STALL Dest-8 Dest-8 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 BUSY! When excessive I/O destined to a specific SSD When the target SSD is busy, even though the link is available, I/O request should stay in buffer This stall is because a specific SSD is continuously busy  storage contention

Unboxing SSD for Cost Reduction
Still useful, so reusable Useless, so replaced Host interface controller Flash controllers 35~50% of total SSD cost Microprocessors DRAM buffers Firmware Bare NAND flash packages SSD Internal Worn-out flash packages should be replaced Many logics in SSDs including H/W controllers and firmware are wasted, when worn-out SSDs are replaced Instead of a whole SSD, let’s use only bare flash packages

Use of Unboxed Flash Packages, FIMM
Multiple NAND flash packages integrated into a board Looks like passive memory device such as DIMM Referred to as Flash Inline Memory Module (FIMM) Control-signals and pin-assignment are defined For convenient replacement of worn-out FIMMs A FIMM has hot swappable connector NV-DDR2 interface design by ONFi

How FIMMs Connected? PCI-E
root complex HPC FIMM FIMM FIMM switch switch switch FIMM FIMM endpoint endpoint FIMM FIMM FIMM FIMM FIMM PCI-E technology provides high-performance interconnect Root complex – I/O start point component Switch – middle-layer components Endpoint – where FIMMs directly attached Link – bus connecting components

Connection between FIMMs and PCI-E
PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet Front-end PCI-E protocol for PCI-E fabric Back-end ONFi NV-DDR2 interface for FIMMs Endpoint consists of three parts PCI-E device layers: handle PCI-E interface Control logic: handles FIMMs over ONFi interface Upstream/downstream buffers: control traffic communication

Connection between FIMMs and PCI-E
Communication example (1) PCI-E packet arrived at target endpoint (2) PCI-E device layers disassemble the packet (3) The disassembled packet is enqueued into downstream buffer (4) HAL dequeues the packet and constructs a NAND flash command Hot-swappable connector for FIMMs ONFi 78-pin NV-DDR2 slot

Triple-A Architecture
Multi-cores DRAMs PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM PCI-E allows architect to configure any configuration Endpoint is where FIMMs are directly attached Triple-A comprises a set of FIMMs using PCI-E Useful parts of SSDs are aggregated on top of PCI-E fabric

Triple-A Architecture
Hosts CNs Multi-cores DRAMs Management Module PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM Flash control logic is also moved out of SSD internal Address translation, garbage collection, IO scheduler, and so on Autonomic I/O contention managements Triple-A architectures interact with hosts or compute nodes

Link Contention Management
shared data bus PCI-E Switch End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM (1) Hot cluster detection – I/O stalled due to link contention

Link Contention Management
Cold Cluster PCI-E Switch End Point FIMM FIMM FIMM FIMM FIMM IDLE!!! End Point FIMM FIMM FIMM FIMM Shadow-cloning can hide the migration overheads End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM Hot cluster detection – I/O stalled due to link contention Cold cluster securement – clusters with free link Autonomic data migration – from hot to cold cluster

Storage Contention Management
Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled REQ-3 … Stalled REQ-3 Issued REQ-1 End Point Stalled REQ-3 FIMM FIMM FIMM FIMM Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue

Storage Contention Management
Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled Issued REQ-3 3  4 … Stalled Issued REQ-3 3  2 Issued REQ-1 End Point Stalled Issued REQ-3 FIMM FIMM FIMM FIMM 3  4 Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue Write I/O – physical data-layout reshaping (to no-laggard neighbors) Read I/O – shadow copying (to no-laggard neighbors) & reshaphing

Experimental Setup Flash array network simulation model
Captures PCI-E specific characteristics Data movement delay, switching and routing latency (PLX 3734), contention cycles Configures diverse system parameters Will be available in the public (preparing an open-source framework) Baseline all-flash array configuration 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB 80 clusters, 320 FIMM network evaluation Workloads Enterprise workloads (cfs, fin, hm, mds, msnfs, …) HPC workload (eigensolver simulated at LBNL supercomputer) Micro-benchmarks (read/write, sequential/random) 150 ns, 1000 queues

Latency Improvement Triple-A latency normalized to non-autonomic all-flash array Real-world workloads: enterprise and HPC I/O traces On average, x5 shorter latency Specific workloads (cfs and web) generate no hot clusters

Throughput Improvement
Triple-A IOPS normalized for system throughput On average, x6 higher IOPS Specific workloads (cfs and web) generate no hot clusters Triple-A boosts the storage system by resolving contentions

Queue Stall Time Decrease
Queue stall time come from two resource contentions On average, stall time shortened by 81% According to our analysis, Triple-A decreases dramatically link-contention time msnfs shows low I/O ratio on hot clusters

Network Size Sensitivity
non-autonomic array Triple-A By increasing the number of clusters (endpoints) Execution time broken-down into stall times and storage lat. Triple-A shows better performance on larger networks PCI-E components stall times are effectively reduced FIMM latency is out of Triple-A’s concern

Related Works (1) Market products (SSD array)
[Pure Storage] one-large pool storage system with 100% NAND flash based SSDs [Texas Memory Systems] 2D flash-RAID [Violin Memory] flash memory array of 1000s of flash cells Academia study (SSD array) [A.M.Caulfield, ISCA’13] proposed SSD-based storage area network (QuickSAN) by integrating network adopter into SSDs [A.Akel, Hotstorage’11] proposed a prototype of PCM based storage array (Onyx) [A.M.Caulfield, MICRO’10] proposed a high-performance storage array architecture for emerging non-volatile memories

Related Works (2) Academia study (SSD RAID)
[M.Balakrishnan, TS’10] proposed SSD-optimized RAID for better reliability by creating age disparities within arrays [S.Moon, Hotstorage’13] investigated the effectiveness of SSD-based RAID and discussed the reliability potential Academia study (NVM usage for HPC) [A.M.Caulfield, ASPLOS’09] exploited flash memory to clusters for the performance and power consumption [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC

Conclusions Challenge: SSD array might not be suitable for high-performance storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture suitable for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays

Backup

Data Migration Overhead
Naïve migration A data migration comprises a series of (1) Data read from source FIMM (2) Data move to parental switch (3) Data move to target endpoint (4) Data write to target FIMM Naïve data migration activity shares all-flash array resources with normal I/O requests I/O latency delayed due to resource contention

Data Migration Overhead
Naïve migration Shadow cloning Data read of data migration (first step) hurts seriously the system performance Shadow cloning overlaps normal read I/O request and data read of data migration Shadow cloning successfully hides the data migration overhead and minimizes the system performance degradation

Real Workload Latency (1)
proj msnfs CDF of workload latency for non-autonomic all-flash array and Triple-A Triple-A significantly improves I/O request latency Relatively low latency improvement in msnfs Ratio of I/O requests heading to hot clusters is not very high Hot clusters detected, but not that hot (less hot)

Real Workload Latency (2)
prxy websql Prxy experiences great latency improved by Triple-A Websql did not get more benefit than expected Due to more and hotter clusters than proxy But, all hot clusters are located in the same switch In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A

Network Size Sensitivity
non-autonomic Normalized to all-flash arrays Triple-A successfully reduces both contention time By distributing extra load of hot clusters Data migration and physical data reshaping Link contention time is all most completely eliminated Storage contention time is steadily reduced It is bounded by the number of I/O requests to target clusters

SSD-1 SSD-2 SSD-3 SSD-4 Dest- 3 READY! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 BUSY! Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O) When the SSD where an I/O destined is ready, I/O can get service right away When the SSD where an I/O destined is busy, I/O should wait

Myoungsoo Jung (UT-Dallas)

Similar presentations

Presentation on theme: "Myoungsoo Jung (UT-Dallas)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Myoungsoo Jung (UT-Dallas)

Similar presentations

Presentation on theme: "Myoungsoo Jung (UT-Dallas)"— Presentation transcript:

Similar presentations

About project

Feedback