Myoungsoo Jung (UT-Dallas)

Slides:



Advertisements
Similar presentations
Chapter 6 I/O Systems.
Advertisements

System Integration and Performance
Key Metrics for Effective Storage Performance and Capacity Reporting.
M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas.
Virtualisation From the Bottom Up From storage to application.
Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas.
Flash storage memory and Design Trade offs for SSD performance
Misbah Mubarak, Christopher D. Carothers
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU)
2. Computer Clusters for Scalable Parallel Computing
Song Han, Xiuming Zhu, Al Mok University of Texas at Austin
Improving Networks Worldwide. UNH InterOperability Lab Serial Advanced Technology Attachment (SATA) Use Cases.
International Conference on Supercomputing June 12, 2009
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Chapter 1 and 2 Computer System and Operating System Overview
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Power saving technique for multi-hop ad hoc wireless networks.
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
UNH InterOperability Lab Serial Advanced Technology Attachment (SATA) Use Cases.
Comparing Coordinated Garbage Collection Algorithms for Arrays of Solid-state Drives Junghee Lee, Youngjae Kim, Sarp Oral, Galen M. Shipman, David A. Dillow,
Solid State Drive Feb 15. NAND Flash Memory Main storage component of Solid State Drive (SSD) USB Drive, cell phone, touch pad…
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Embedded System Lab. 서동화 HIOS: A Host Interface I/O Scheduler for Solid State Disk.
/38 Lifetime Management of Flash-Based SSDs Using Recovery-Aware Dynamic Throttling Sungjin Lee, Taejin Kim, Kyungho Kim, and Jihong Kim Seoul.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
A Cyclic-Executive-Based QoS Guarantee over USB Chih-Yuan Huang,Li-Pin Chang, and Tei-Wei Kuo Department of Computer Science and Information Engineering.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions Youyou Lu 1, Jiwu Shu 1, Jia Guo 1, Shuai Li 1, Onur Mutlu 2 LightTx:
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Sunpyo Hong, Hyesoon Kim
Virtual-Channel Flow Control William J. Dally
Tackling I/O Issues 1 David Race 16 March 2010.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
 The emerged flash-memory based solid state drives (SSDs) have rapidly replaced the traditional hard disk drives (HDDs) in many applications.  Characteristics.
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Yiting Xia, T. S. Eugene Ng Rice University
BD-Cache: Big Data Caching for Datacenters
Architecture and Algorithms for an IEEE 802
Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.
Flash Storage 101 Revolutionizing Databases
ISPASS th April Santa Rosa, California
BD-CACHE Big Data Caching for Datacenters
Unistore: A Unified Storage Architecture for Cloud Computing
Ping-Sung Yeh, Te-Hao Hsu Conclusions Results Introduction
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
Short Circuiting Memory Traffic in Handheld Platforms
CLUSTER COMPUTING.
PARAMETER-AWARE I/O MANAGEMENT FOR SOLID STATE DISKS
Parallel Garbage Collection in Solid State Drives (SSDs)
Presentation transcript:

Myoungsoo Jung (UT-Dallas) Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf (LBNL), Mahmut Kandemir (PSU)

Executive Summary Challenge: SSD array might not be suitable for a high-performance computing storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array

Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions

HPC starts to employ SSDs SSD-Cache on HDD Arrays HPC SSD-buffer on Compute-Node HDD Arrays SSD Arrays SSD arrays are in position to (partially) replace HDD arrays

High-cost Maintenance of SSD Arrays Abandon! Replace! Live! Worn-out! Dead! As time goes by, worn-out SSD should be replaced The thrown-away SSD has complex internals Other parts are still useful, only flash memories are useless

I/O Services Suffered in SSD Arrays Varying data locality in an array, which consist of 80 SSDs Hot region is a group of SSDs having 10% of total data Arrays without a hot region show reasonable latency As the number of hot regions increases, the performance of SSD arrays degrades

Why Latency Delayed? Link Contention IDLE! SSD-1 SSD-2 SSD-3 SSD-4 Dest- 1 Dest- 3 Dest- 2 Dest- 4 READY! IDLE! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-6 READY! A single data bus is shared by a group of SSDs When target SSD is ready and the shared bus is idle, the I/O request can get service right away When excessive I/Os destined to a specific group of SSDs

Why Latency Delayed? Link Contention STALL IDLE! BUSY! Dest- 4 Dest- 1 SSD-1 SSD-2 SSD-3 SSD-4 Dest- 2 READY! READY! READY! BUSY! IDLE! Dest-6 SSD-5 SSD-6 SSD-7 SSD-8 READY! When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer This stall is because SSDs in a group share a data bus  link contention

Why Latency Delayed? Storage Contention SSD-1 SSD-2 SSD-3 SSD-4 Dest-1 Dest-4 Dest- 3 READY! BUSY! Dest-2 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 Dest-8 BUSY! READY! Dest-8 When excessive I/Os destined to a specific SSD

Why Latency Delayed? Storage Contention Dest-4 Dest-1 SSD-1 SSD-2 SSD-3 SSD-4 Dest-3 Dest-2 READY! BUSY! BUSY! BUSY! READY! STALL Dest-8 Dest-8 SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 Dest-8 BUSY! When excessive I/O destined to a specific SSD When the target SSD is busy, even though the link is available, I/O request should stay in buffer This stall is because a specific SSD is continuously busy  storage contention

Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions

Unboxing SSD for Cost Reduction Still useful, so reusable Useless, so replaced Host interface controller Flash controllers 35~50% of total SSD cost Microprocessors DRAM buffers Firmware Bare NAND flash packages SSD Internal Worn-out flash packages should be replaced Many logics in SSDs including H/W controllers and firmware are wasted, when worn-out SSDs are replaced Instead of a whole SSD, let’s use only bare flash packages

Use of Unboxed Flash Packages, FIMM Multiple NAND flash packages integrated into a board Looks like passive memory device such as DIMM Referred to as Flash Inline Memory Module (FIMM) Control-signals and pin-assignment are defined For convenient replacement of worn-out FIMMs A FIMM has hot swappable connector NV-DDR2 interface design by ONFi

How FIMMs Connected? PCI-E root complex HPC FIMM FIMM FIMM switch switch switch FIMM FIMM endpoint endpoint FIMM FIMM FIMM FIMM FIMM PCI-E technology provides high-performance interconnect Root complex – I/O start point component Switch – middle-layer components Endpoint – where FIMMs directly attached Link – bus connecting components

Connection between FIMMs and PCI-E PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet Front-end PCI-E protocol for PCI-E fabric Back-end ONFi NV-DDR2 interface for FIMMs Endpoint consists of three parts PCI-E device layers: handle PCI-E interface Control logic: handles FIMMs over ONFi interface Upstream/downstream buffers: control traffic communication

Connection between FIMMs and PCI-E Communication example (1) PCI-E packet arrived at target endpoint (2) PCI-E device layers disassemble the packet (3) The disassembled packet is enqueued into downstream buffer (4) HAL dequeues the packet and constructs a NAND flash command Hot-swappable connector for FIMMs ONFi 78-pin NV-DDR2 slot

Triple-A Architecture Multi-cores DRAMs PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM PCI-E allows architect to configure any configuration Endpoint is where FIMMs are directly attached Triple-A comprises a set of FIMMs using PCI-E Useful parts of SSDs are aggregated on top of PCI-E fabric

Triple-A Architecture Hosts CNs Multi-cores DRAMs Management Module PCI-E Fabric RCs Switches Endpoint Endpoint Endpoint Endpoint FIMM FIMM FIMM FIMM Flash control logic is also moved out of SSD internal Address translation, garbage collection, IO scheduler, and so on Autonomic I/O contention managements Triple-A architectures interact with hosts or compute nodes

Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions

Link Contention Management shared data bus PCI-E Switch End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM (1) Hot cluster detection – I/O stalled due to link contention

Link Contention Management Cold Cluster PCI-E Switch End Point FIMM FIMM FIMM FIMM FIMM IDLE!!! End Point FIMM FIMM FIMM FIMM Shadow-cloning can hide the migration overheads End Point FIMM FIMM FIMM FIMM FIMM BUSY!!! Hot Cluster End Point FIMM FIMM FIMM FIMM Hot cluster detection – I/O stalled due to link contention Cold cluster securement – clusters with free link Autonomic data migration – from hot to cold cluster

Storage Contention Management Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled REQ-3 … Stalled REQ-3 Issued REQ-1 End Point Stalled REQ-3 FIMM FIMM FIMM FIMM Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue

Storage Contention Management Laggard Switch Issued REQ-3 End Point FIMM-1 FIMM-2 FIMM-3 FIMM-4 FIMM-1 FIMM-2 FIMM-3 FIMM-4 Stalled Issued REQ-3 3  4 … Stalled Issued REQ-3 3  2 Issued REQ-1 End Point Stalled Issued REQ-3 FIMM FIMM FIMM FIMM 3  4 Issued REQ-2 QUEUE Laggard detection – I/O stalled due to storage contention Autonomic data-layout reshaping for stalled I/O in queue Write I/O – physical data-layout reshaping (to no-laggard neighbors) Read I/O – shadow copying (to no-laggard neighbors) & reshaphing

Outline Motivations Triple-A Architecture Triple-A Management Evaluations Conclusions

Experimental Setup Flash array network simulation model Captures PCI-E specific characteristics Data movement delay, switching and routing latency (PLX 3734), contention cycles Configures diverse system parameters Will be available in the public (preparing an open-source framework) Baseline all-flash array configuration 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB 80 clusters, 320 FIMM network evaluation Workloads Enterprise workloads (cfs, fin, hm, mds, msnfs, …) HPC workload (eigensolver simulated at LBNL supercomputer) Micro-benchmarks (read/write, sequential/random) 150 ns, 1000 queues

Latency Improvement Triple-A latency normalized to non-autonomic all-flash array Real-world workloads: enterprise and HPC I/O traces On average, x5 shorter latency Specific workloads (cfs and web) generate no hot clusters

Throughput Improvement Triple-A IOPS normalized for system throughput On average, x6 higher IOPS Specific workloads (cfs and web) generate no hot clusters Triple-A boosts the storage system by resolving contentions

Queue Stall Time Decrease Queue stall time come from two resource contentions On average, stall time shortened by 81% According to our analysis, Triple-A decreases dramatically link-contention time msnfs shows low I/O ratio on hot clusters

Network Size Sensitivity non-autonomic array Triple-A By increasing the number of clusters (endpoints) Execution time broken-down into stall times and storage lat. Triple-A shows better performance on larger networks PCI-E components stall times are effectively reduced FIMM latency is out of Triple-A’s concern

Related Works (1) Market products (SSD array) [Pure Storage] one-large pool storage system with 100% NAND flash based SSDs [Texas Memory Systems] 2D flash-RAID [Violin Memory] flash memory array of 1000s of flash cells Academia study (SSD array) [A.M.Caulfield, ISCA’13] proposed SSD-based storage area network (QuickSAN) by integrating network adopter into SSDs [A.Akel, Hotstorage’11] proposed a prototype of PCM based storage array (Onyx) [A.M.Caulfield, MICRO’10] proposed a high-performance storage array architecture for emerging non-volatile memories

Related Works (2) Academia study (SSD RAID) [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for better reliability by creating age disparities within arrays [S.Moon, Hotstorage’13] investigated the effectiveness of SSD-based RAID and discussed the reliability potential Academia study (NVM usage for HPC) [A.M.Caulfield, ASPLOS’09] exploited flash memory to clusters for the performance and power consumption [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC

Conclusions Challenge: SSD array might not be suitable for high-performance storage Our goal: propose a new high-performance storage architecture Observation High maintenance cost: caused by worn-out flash-SSD replacements Performance degradation: caused by shared resource contentions Key Ideas Cost reduction: by taking bare NAND flash out from SSD box Contention resolve: by distributing excessive I/O generating bottlenecks Triple-A: a new architecture suitable for HPC storages Consists of non-SSD bare flash memories Automatically detects and resolves the performance bottlenecks Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays

Backup

Data Migration Overhead Naïve migration A data migration comprises a series of (1) Data read from source FIMM (2) Data move to parental switch (3) Data move to target endpoint (4) Data write to target FIMM Naïve data migration activity shares all-flash array resources with normal I/O requests I/O latency delayed due to resource contention

Data Migration Overhead Naïve migration Shadow cloning Data read of data migration (first step) hurts seriously the system performance Shadow cloning overlaps normal read I/O request and data read of data migration Shadow cloning successfully hides the data migration overhead and minimizes the system performance degradation

Real Workload Latency (1) proj msnfs CDF of workload latency for non-autonomic all-flash array and Triple-A Triple-A significantly improves I/O request latency Relatively low latency improvement in msnfs Ratio of I/O requests heading to hot clusters is not very high Hot clusters detected, but not that hot (less hot)

Real Workload Latency (2) prxy websql Prxy experiences great latency improved by Triple-A Websql did not get more benefit than expected Due to more and hotter clusters than proxy But, all hot clusters are located in the same switch In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A

Network Size Sensitivity non-autonomic Normalized to all-flash arrays Triple-A successfully reduces both contention time By distributing extra load of hot clusters Data migration and physical data reshaping Link contention time is all most completely eliminated Storage contention time is steadily reduced It is bounded by the number of I/O requests to target clusters

Why Latency Delayed? Storage Contention SSD-1 SSD-2 SSD-3 SSD-4 Dest- 3 READY! SSD-5 SSD-6 SSD-7 SSD-8 Dest-8 BUSY! Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O) When the SSD where an I/O destined is ready, I/O can get service right away When the SSD where an I/O destined is busy, I/O should wait