1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Network Storage and Cluster File Systems Jeff Chase CPS 212, Fall 2000.
IB in the Wide Area How can IB help solve large data problems in the transport arena.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
1 I/O Management in Representative Operating Systems.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
High-Performance Object Access in OSD Storage Subsystem Yingping Lu.
Distributed File System: Data Storage for Networks Large and Small Pei Cao Cisco Systems, Inc.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
Module – 7 network-attached storage (NAS)
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.
Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.
Implementing Convergent Networking: Partner Concepts
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
SRP Update Bart Van Assche,.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Windows RDMA File Storage
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
LWIP TCP/IP Stack 김백규.
Database Edition for Sybase Sales Presentation. Market Drivers DBAs are facing immense time pressure in an environment with ever-increasing data Continuous.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Brajesh Goyal Principal Product Manager, Grid Computing Ravi Animi Technical Marketing Engineer Network Appliance, Inc.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
High Performance Communication for Oracle using InfiniBand Ross Schibler CTO Topspin Communications, Inc Session id: #36568 Peter Ogilvie Principal Member.
User-mode I/O in Oracle 10g with ODM and DAFS Jeff Silberman Systems Architect Network Appliance Session id: Margaret Susairaj Server Technologies.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
AoE and HyperSCSI on Linux PDA Prepared by They Yu Shu.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
MDC323B SMB 3 is the answer Ned Pyle Sr. PM, Windows Server
Chapter 131 Distributed Processing, Client/Server, and Clusters Chapter 13.
iSER update 2014 OFA Developer Workshop Eyal Salomon
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Progress in Standardization of RDMA technology Arkady Kanevsky, Ph.D Chair of DAT Collaborative.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Nexsan iSeries™ iSCSI and iSeries Topologies Name Brian Montgomery
Berkeley Cluster Projects
Module – 7 network-attached storage (NAS)
Internetworking: Hardware/Software Interface
CLUSTER COMPUTING.
Application taxonomy & characterization
Presentation transcript:

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy & Anthony Skjellum MPI Software Technology

Why DAFS?  DAFS for Parallel I/O – Industry Storage Standards for HPC Community – Performance, Performance, Performance – Reduce TCO (total cost of ownership)  Performance of RDMA based File System – Bandwidth – Latency – CPU overhead  Transport independence – Virtual Interface (VI) Architecture – InfiniBand Architecture – iWARP  Network Appliance filer as DAFS server for transport independence, performance and multi-protocol support

 Direct Access File System protocol  A file access protocol designed specifically for high-performance data center file sharing – Optimized for high performance – Semantics for clustered file sharing environment  A fundamentally new way for high-performance and cluster applications to access file storage – Provides direct application access to transport resources – Avoids Operating System overhead What is DAFS?

 File access protocol providing all the features of NFS v3  Includes NFS v4 features – File locking, CIFS sharing, security, etc  Adds data sharing features for clusters – Clustered apps – Graceful fail-over of clustered file servers – High volume, optimized I/O applications – Efficient multi-client sharing  Designed for Direct Access (RDMA) Transports – Optimal use of transport capabilities – Transport independent What Does DAFS Do?

DAFSLocal FS File System SCSI Driver FS Switch NFS TCP/IP FS Switch NFS Buffers DAFS DAPL Application Buffers Application Buffers Application Buffer Cache HCANICHBA Buffer Cache Packet Buffers NIC DriverHBA DriverHCA Driver File Access Methods User Kernel H/W

 No data packet fragmentation or reassembly – Benefit similar to IP Jumbo Frames, but with larger packets  less transmission overhead, fewer interrupts  no ordering and space management issues  no data copying to recreate contiguous buffers  No realignment of data copies – Protocol headers and data buffers transmitted separately – Allows data alignment to be preserved  No user/kernel boundary crossing – Less system call overhead  No user/kernel data copies – Data transferred directly to application buffers Direct Access Performance Benefits

Direct-attached (FC) storage w/ local FS (ufs) Raw access to direct-attached (FC) storage App Server CPU sec/op Direct-attached (FC) storage w/ volume manager DAFS kernel device driver (w/ VI/IP HBA) User-level DAFS client (w/ VI/IP HBA) User-level DAFS client (w/ 4X IB HCA) - estimated <20 Sun E3500 (400MHz) w/ Solaris 2.8 OLTP workload – 66% reads 4kB transfers; async I/O DAFS Performance

Why MPI-IO?  Parallel File System API  Combined API for I/O and Communication  File I/O and direct storage semantic support  File Info for file “partitioning”  Memory Registration for both I/O and Communication  ChaMPIon/Pro for parallelism and portability – first commercial MPI-2.1 version – Scaling to thousands and tens of thousands of processors and beyond – Multi-device support (including InfiniBand Architecture) – Topology awareness – Thread safety – Optimized collective operations – Efficient memory (and NIC resource) usage – Integration with debuggers and profilers – Optimized MPI-IO Early binding Persistency Layering blocking MPI calls on asynchronous transport operations

Design overview  MPI-IO partitions user file according to MPI_FILE_INFO into cells  Uses uDAFS API on the client to reduce CPU overhead and improve other performance measures  Each cell is a separate file stored on DAFS server (NetApp filer) – Distribute cells across multiple DAFS servers – Multiple cells can be stored on the same DAFS server  Metadata per file – Metadata is stored as a file and accessed on File Open, Close, or attributes changes  MPI file accesses: Read, Write - directly accesses cells with no or minimal conflict. – No Metadata accesses – DAFS supports locking for conflict resolution.

Metadata & File Virtualization  Metadata contains: – File ACL – File attributes – Cell list Cell number Cell file name  Cell file convention – “file_name”_ “unique_identifier”_ “cell_number”  Metadata Files Separate volume on DAFS server Volume mirrored between 2 DAFS servers  Cells are in a separate volume on each DAFS server  Security – Metadata ACL determines access to all of its cells

Early Experience - I  2 DAFS servers, 2 MPI client  Clients on Sun Ultra 30 with 296 MHz processors  Network – 1Gb VI-IP

Early Experience - II  2 DAFS servers, 2 MPI client  Clients on Sun Ultra 30 with 296 MHz processors  Network – 1Gb VI-IP