Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.

Slides:



Advertisements
Similar presentations
Institute of Computer Science AGH Towards Multilanguage and Multiprotocol Interoperability: Experiments with Babel and RMIX Maciej Malawski, Daniel Harężlak,
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Designing a CCA Framework for Accelerators Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Multiple Processor Systems
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
FastOS, Santa Clara CA, June Scalable Fault Tolerance: Xen Virtualization for PGAs Models on High-Performance Networks Daniele Scarpazza, Oreste.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Distributed Systems Lecture # 3. Administrivia Projects –Design and Implement a distributed file system Paper Discussions –Discuss papers as case studies.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Supporting GPU Sharing in Cloud Environments with a Transparent
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
1 Chapter 2. Communication. STEM-PNU 2 Layered Protocol TCP/IP : de facto standard Our Major Concern Not always 7-layered Protocol But some other protocols.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
The NE010 iWARP Adapter Gary Montry Senior Scientist
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
High Performance Event Service for CCA Framework: Design and Experiences Khushbu Agarwal Manoj Krishnan Daniel Chavarria Ian Gorton.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Chapter 5: Distributed objects and remote invocation Introduction Remote procedure call Events and notifications.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
The Internet Book. 3 The Internet Works Well The Internet is a marvel of technical accomplishment. TCP/IP: – Accommodates growth and change not imagined.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
Distributed objects and remote invocation Pages
1 Chapter 2. Communication. STEMPusan National University STEM-PNU 2 Layered Protocol TCP/IP : de facto standard Our Major Concern Not always 7-layered.
Toward a Distributed and Parallel High Performance Computing Environment Johan Carlsson and Nanbor Wang Tech-X Corporation Boulder,
Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.
09/03/2003Parrallel Computing Conference JToe : a Java API for Object Exchange Serge Chaumette, Pascal Grange, Benoit Métrot, Pierre Vignéras LaBRI,
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.
RMI (Java RMI) P 460 in text UUIDs / system wide references Transparent: all objects either by ref. or by value. Not efficient, especially int, bool, etc.
Balazs Voneki CERN/EP/LHCb Online group
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Extending Java RMI for Dynamic Reconfiguration
Multiple Processor Systems
IDSS Lab – research directions Sept 6, 2002
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Prof. Leonardo Mostarda University of Camerino
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols

Motivation Remote Method Invocation provides a useful abstraction for distributed computing Example: event service for CCA framework Existing TCP/IP based implementation has performance problems Question: can we speed up Babel RMI with high performance communication protocols 2

Objectives Demonstrate that it is feasible to build high performance Babel RMI Prototype a Babel RMI with ARMCI and measure its performance experimentally Produce a quality implementation of high performance RMI 3

Outline Motivation Objectives Background Babel RMI ARMCI Preliminary performance results Future works 4

Babel RMI Babel supports Remote Method Invocation Transparent Flexible Implemented with extensive code marshalling and runtime library Existing TCP/IP based implementation incurs high overhead Multiple copying Context switching 5

TCP RMI Performance 6

ARMCI Middleware for remote memory access (RMA) Support many networks and HPC systems Myrinet, Infiniband, Quadrics, Giganet, … Cray XT4, XT, X1, IBM BlueGene,… Efficient Minimum number of copying Truly one side communication protocol Put, get, accumulating Atomic read-modified-write, mutex Blocking and non-blocking interfaces 7

Experiment Setup Hardware cluster with 11 nodes 4 core 2.4 GHz Intel Xeon processor Infiniband DDR network Software Babel ARMCI 1.4 OpenMPI

Implementation Implemented extensive set of functions in the runtime library InstanceHandle, Server, Invocation, Response, Call, Return, … Usage Examples hello_World h = hello_World__createRemote(armcihandler:// :, &_ex); hello_World h2 = hello_World__connect(armcihandler:// : / &_ex); 9

ARMCI RMI Performance 10

Next Step Reduce protocol overhead Reduce function call overhead Reduce copying Batch RMI Call Reduce RDMA overhead Prefetch in the background Preload libraries Prefech arguments 11

Where to Use High Performance Babel RMI Applications for high performance RMI Fine grain distribution Hybrid computing Suggestions … 12