Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

Categories of I/O Devices
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
Introduction to Systems Architecture Kieran Mathieson.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
OS Spring’03 Introduction Operating Systems Spring 2003.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Figure 1.1 Interaction between applications and the operating system.
Chapter 11 Operating Systems
MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.
Virtual Memory Deung young, Moon ELEC 5200/6200 Computer Architecture and Design Lectured by Dr. V. Agrawal Lectured by Dr. V.
1 I/O Management in Representative Operating Systems.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
PRASHANTHI NARAYAN NETTEM.
Modeling and Evaluation of Fibre Channel Storage Area Networks Xavier Molero, Federico Silla, Vicente Santonia and Jose Duato.
File Systems (2). Readings r Silbershatz et al: 11.8.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.
SRP Update Bart Van Assche,.
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
BU01. Main tasks of Operating System  To hide HW specifics (abstract layer for programs)  Processes maintenance  Memory maintenance  Files maintenance.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
ITEC 502 컴퓨터 시스템 및 실습 Chapter 8-2: I/O Management (Review) Mi-Jung Choi DPNM Lab. Dept. of CSE, POSTECH.
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
A Cyclic-Executive-Based QoS Guarantee over USB Chih-Yuan Huang,Li-Pin Chang, and Tei-Wei Kuo Department of Computer Science and Information Engineering.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,
C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
1 BBN Technologies Quality Objects (QuO): Adaptive Management and Control Middleware for End-to-End QoS Craig Rodrigues, Joseph P. Loyall, Richard E. Schantz.
UNIX & Windows NT Name: Jing Bai ID: Date:8/28/00.
Brian Bershad, Thomas Anderson, Edward Lazowska, and Henry Levy Presented by: Byron Marohn Published: 1991.
Low Overhead Real-Time Computing General Purpose OS’s can be highly unpredictable Linux response times seen in the 100’s of milliseconds Work around this.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
SPL/2010 Reactor Design Pattern 1. SPL/2010 Overview ● blocking sockets - impact on server scalability. ● non-blocking IO in Java - java.niopackage ●
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Introduction to Operating Systems Concepts
Introduction to threads
Modeling and Evaluation of Fibre Channel Storage Area Networks
Module 12: I/O Systems I/O hardware Application I/O Interface
CSI 400/500 Operating Systems Spring 2009
CS703 - Advanced Operating Systems
Lecture 15 Reading: Bacon 7.6, 7.7
MPJ: A Java-based Parallel Computing System
Performance-Robust Parallel I/O
Presentation transcript:

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE International Conference on Cluster Computing (Cluster 2005). Burlington, Massachusetts, Sept Repoter: 王柏森

Po-Sen Wang2 Abstract In this paper, we take on the challenge to design a remote paging system for remote memory utilization in InfiniBand clusters. We present the design and implementation of a high performance networking block device (HPBD) over InfiniBand fabric, which serves as a swap device of kernel Virtual Memory (VM) system for efficient page transfer to/from remote memory servers.

Po-Sen Wang3 Outline Introduction Background Designing HPBD Implementation Performance Experiments and Evaluation Conclusion

Po-Sen Wang4 Introduction Even with the dramatic increase in memory capacities, modern applications are quickly keeping pace with and even exceeding the resources of these systems. In these situations, modern systems with virtual memory management start to swap memory regions to and from the disk.

Po-Sen Wang5 Introduction(cont.) Modern networking technologies such as InfiniBand, Myrinet and Quadrics with their low-latency of a few micro-seconds and high throughput of up to 10 Gbps. The Remote Direct Memory Access (RDMA) operations featuring low CPU utilization provide us a new vision to utilize remote resources for local system performance improvement.

Po-Sen Wang6 Background InfiniBand Overview Linux Swapping Mechanism Network Block Device

Po-Sen Wang7 InfiniBand Overview In an InfiniBand network, compute nodes are connected to the fabric by Host Channel Adapters (HCA). HCA exposes a queue-pair based transport layer interface to the hosts. –The send queue keeps control information for outgoing messages. –The receive queue keeps descriptions for incoming messages.

Po-Sen Wang8 InfiniBand Overview(cont.)

Po-Sen Wang9 Linux Swapping Mechanism The device driver for each swap device serves the swap requests as normal I/O requests and deals with device specific operations. This mechanism puts remote memory between local memory and local disk system in the memory hierarchy with the caching mechanisms enabled at no additional costs.

Po-Sen Wang10 Network Block Device Network Block Device is a software emulation for local block storage using remote resources at the block level. The idea is to provide a local block level interface to upper OS management layer, while allocate and deallocate remote resources over network.

Po-Sen Wang11 Designing HPBD RDMA Operations and Remote Server Design Registration Buffer Pool Management Event Based Asynchronous Communication Multiple Server Support

Po-Sen Wang12 RDMA Operations and Remote Server Design In HPBD, there are two types of messages: control message and data message.

Po-Sen Wang13 Registration Buffer Pool Management Registration buffer pool is a pre-registered memory area for data message transfer. Memory buffers are allocated from the pool by a first-fit algorithm.

Po-Sen Wang14 Registration Buffer Pool Management(cont.) One problem with the allocation algorithm is external fragmentation of the registration buffer pool. To solve the problem, a merging algorithm is used at buffer deallocation time. –Checks with neighbor regions of the current buffer and merges with them if they are free.

Po-Sen Wang15 Event Based Asynchronous Communication The client side performs asynchronous communication using two threads. –One thread is in charge of sending requests to servers as soon as they are issued by the kernel. –The other thread is in charge of receiving replies from servers.

Po-Sen Wang16 Event Based Asynchronous Communication(cont.) The receiver works in a bursty manner. –It sleeps until a receive completion event is triggered. –When it wakes up, it processes all the replies that are available and goes back to sleep for the next event. The server works in a similar way.

Po-Sen Wang17 Multiple Server Support Multiple server support allows multiple nodes to export their memory for page store demands. In multiple servers scenario, load balancing is a new design issue.

Po-Sen Wang18 Multiple Server Support(cont.) Also because of the high bandwidth feature of InfiniBand, splitting a single request to multiple ones may offset the benefit as well. Thus we choose non-striping scheme in our design, and distribute the swap area across the servers in a blocking pattern.

Po-Sen Wang19 Implementation In the HPBD client driver, we associate each minor device an IBA context, which contains the IBA communication specific information, such as HCA information, completion queue, shared registered memory pool and queue pair arrays.

Po-Sen Wang20 Performance Experiments and Evaluation Micro-benchmark Performance Results

Po-Sen Wang21 Performance Experiments and Evaluation(cont.) Application Performance Results

Po-Sen Wang22 Performance Experiments and Evaluation(cont.) Multiple Server Performance

Po-Sen Wang23 Conclusion Our experiment results show that using HPBD for remote paging, quick sort runs only 1.45 times slower than local memory system, and up to 21 times faster than swapping using local disk. We also identify that host overhead is a key issue for further performance improvement for remote paging over high performance interconnects clusters.