Developing a Scalable Coherent Interface (SCI) device for MPJ Express

Slides:



Advertisements
Similar presentations
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Experiences.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Socket Programming.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Dolphin software SCI Software Replace in Title/Slide Master with Company Logo or delete Hugo Kohmann Dolphin Interconnect Solutions.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Parallel Programming with Java
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Parallel and Distributed Simulation FDK Software.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
MPJ Express Alon Vice Ayal Ofaim. Contributors 2 Aamir Shafi Jawad Manzoor Kamran Hamid Mohsan Jameel Rizwan Hanif Amjad Aziz Bryan Carpenter Mark Baker.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Performance Evaluation of JXTA-* Communication Layers Mathieu Jan PARIS Research Group Paris, November 2004.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Programming Parallel Hardware using MPJ Express By A. Shafi.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
High Performance and Reliable Multicast over Myrinet/GM-2
Module 12: I/O Systems I/O hardware Application I/O Interface
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Cache Memory Presentation I
Is System X for Me? Cal Ribbens Computer Science Department
CMPE419 Mobile Application Development
MPJ: The second generation ‘MPI for Java’
Pluggable Architecture for Java HPC Messaging
MPI-Message Passing Interface
The University of Adelaide, School of Computer Science
MPJ (Message Passing in Java): The past, present, and future
Xen Network I/O Performance Analysis and Opportunities for Improvement
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Multiple Processor Systems
Aamir Shafi MPJ Express: An Implementation of Message Passing Interface (MPI) in Java Aamir Shafi.
Networks Networking has become ubiquitous (cf. WWW)
Basic Mechanisms How Bits Move.
MPJ: A Java-based Parallel Computing System
CSC3050 – Computer Architecture
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Computer Networking A Top-Down Approach Featuring the Internet
CMPE419 Mobile Application Development
Module 12: I/O Systems I/O hardwared Application I/O Interface
Cluster Computers.
Presentation transcript:

Developing a Scalable Coherent Interface (SCI) device for MPJ Express Guillermo López Taboada 14th October, 2005 Dept. of Electronics and Systems University of A Coruña (Spain) http://www.des.udc.es Visitor at Distributed Systems Group http://dsg.port.ac.uk

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Introduction The interconnection network and its associated software libraries play a key role in High Performance Clustering Technology Cluster interconnection technologies: Gb & 10Gb Ethernet Myrinet SCI Infiniband Qsnet Quadrics GSN - HIPPI Giganet Latencies are small (usually under 10us) Bandwidths are high (usually above 1Gbps) Outline the project November 24, 2018

Introduction SCI (Scalable Coherent Interface) Latency 1.42 us (theoretical) Bandwidth 5333 Mbps (bi-directional) Usually without switch (small clusters) Topologies 1D (ring) / 2D (torus 2D) Outline the project November 24, 2018

Introduction Example of a 2D torus SCI cluster with FE (admin) Outline the project November 24, 2018

Introduction Software available from Dolphinics: Outline the project Software available from Scali: ScaIP: IP emulation ScaSISCI: SISCI (Sw Infrastructure for SCI) ScaMPI: proprietary MPI implementation November 24, 2018

Introduction Java’s portability means in networking that only the widely extended TCP/IP is supported by the JDK Previously, IP emulations were used (ScaIP & SCIP) but performance is similar to FE Now a High Performance Socket Implementation, SCI SOCKETS Similar to other Interconnection Tech. Myrinet (IPoGM->GMSockets) Outline the project November 24, 2018

Introduction Several research projects have been trying to get support in Java for these System Area Networks, mainly in Myrinet: KaRMI/GM (JavaParty, Univ. Karlsruhe) Manta/LFC/Panda/Ibis (Univ. Vrije – Holland) Java GM Sockets RMIX myrinet mpiJava/MPICH-GM or MPICH-MX … But nothing in SCI Outline the project November 24, 2018

Introduction My PhD Project: “Designing Efficient Mechanisms for Java communications on SCI systems” The motivation is filling the gap between Java and this high-speed interconnect, which lacks of sw support for Java SCI Java Fast Sockets An SCI communication device, base of a messaging system SCI Channel for Java NIO Wrappers for some libraries Optimized RMI for High Speed Networks Low level Java buffering and communication system Outline the project November 24, 2018

Introduction MPJ Express, a reference implementation of the MPI bindings for the Java language, has been released. Already mature bindings for C, C++, and Fortran, but ongoing efforts on the Java binding at DSG A good opportunity to provide SCI support to a messaging system Outline the project November 24, 2018

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Design of scidev Use of Java Native Interface JNI (unavoidable) In order to provide support and good performance we have to rely on specific low level libraries In the presence of SCI hw it should use it Lost of portability in exchange of higher performance Differences between mpiJava and scidev: mpiJava- thin wrapper providing a large number of Java MPI primitives scidev- thicker layer providing a small API November 24, 2018

Design of scidev Implementing the xdev API: init() finish() id() iprobe(ProcessID srcID, int tag, int context) irecv(Buffer buf, ProcessID srcID, int tag, int context, Status status) isend(Buffer buf, ProcessID destID, int tag, int context) and the blocking counterparts of these functions: probe, recv, send + issend & ssend November 24, 2018

Design of scidev November 24, 2018

Design of scidev mpjdev JVM xdev mxdev scidev JNI O.S Native Libraries November 24, 2018

Design of scidev Native libraries: SCILib and SISCI SCILIB Outline the project November 24, 2018

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Implementation Issues Optimizations / initialization process: JNI: Caching field identifiers and references to objects Sending 2 messages in Long protocol 1st from a 4-byte multiple address and second from a 128-byte multiple address up to a 128-byte multiple address (go further the end of the message – raw Buffer has a 2^n length) Algorithm to init the message queues of SCILib Connect (to nodes with lower rank) Create (for all nodes, beginning with the following rank) Connect (the remaining nodes) The complexity is O(n) November 24, 2018

Implementation Issues Tranport protocols: 3 native protocols: Inline 1-113b Short 114b-64Kb Long 64Kb-1Mb scidev fragments messages > 1MB and is using: Inline for control messages and small messages<113b Short with PIO (Programmed Input-Output) for messages < 8Kb Short with DMA (Direct Memory Access) for messages 8-64Kb Long in user level libraries does not use DMA transfers, so it is replaced by own Long protocol with DMA tx November 24, 2018

Implementation Issues Communications: scidev is based on non-blocking communications It’s coded having niodev as template Asynchronous sends for messages sizes > 1MB Notification strategy: Following the approach of SCI SOCKET, using the mbox interruption library Created without transfering the references (SCI interrupt handlers) Each interruption (both user_interruptions and dma_interruptions) register a callback method November 24, 2018

Implementation Issues Sending/Receiving: 2 threads: user and selector thread, synchronized for reducing latency 1 message queue in which the control messages of pending communications are kept Sending directly from the “Buffer” Direct ByteBuffer If selector thread receives a message not posted -> creates an intermediate buffer for temporal storage If the message has been posted, it copies the message directly to the “Buffer” Direct ByteBuffer November 24, 2018

Implementation Issues This schema for each pair of nodes selector thread user thread user thread SBUFFER RBUFFER ULL ULL LONG LONG Intermediate SHORT SHORT Queue Queue Queue Queue SCI Inline Inline November 24, 2018

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Benchmarking JDK 1.5 on holly. Latency (us). SCI 51 12 5 11 FE 161 145 MPJE mpiJava C sockets Java S. SCI 51 12 5 11 FE 161 145 83 109 GbE 131 101 65 86 scidev latency is 33us! November 24, 2018

Benchmarking JDK 1.5 on holly. Asymptotic Bandwidths (Mbps). SCI 1200 MPJE mpiJava C sockets Java S. SCI 1200 1480 400 360 FE 90 92 93 GbE 680 587 900 600* scidev throughput is 1280 Mbps! November 24, 2018

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Future work Immediatily: Testing for collective communications (here only was for point-to-point) A design with lower interdependence between xdev and mpjbuf Get information from different formats of configuration files in SCI Benchmarking with MPJ applications and developing MPJ and xdev applications. New buffering implementation November 24, 2018

Future work Buffering System with Sbuffer and Rbuffer in ULL (still intermidiate) SBUFFER RBUFFER ULL ULL SBUFFER RBUFFER LONG LONG Intermediate SHORT SHORT Queue Queue Queue Queue SCI Inline Inline November 24, 2018

Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018

Conclusions Performance is still a problem Try to avoid control message. Maybe integrating this data in the ul library Aim: latency 30us & Bw 1350 Mbps Current phase in developing: Testing Hard to do multiple initializations in a single thread (restart the device) Design is a bit coupled with MPJ – strong interdependence Needs evaluation and implementation using a kernel level library (threads and spawns process natively) November 24, 2018

Questions ? November 24, 2018

Appendix Visitor at the DSG during summer 05 Pursuing PhD at Univ. of A Coruña (Spain) November 24, 2018

Appendix BS in Computing Tech. in 2002 at A Coruña Univ. Member of the Computer Architecture Group. Areas of interest of the group: High Performance compilers (automatic detection of parallelism) Cluster computing Grid applications Management of Parallel/Distributed systems Fault tolerance in MPI Computer graphics (rendering, radiosity) Geographical Information Systems 12 staff members, 8 PhD students November 24, 2018

Appendix Computer Architecture Group. Crossgrid (eu project within Gridstart) November 24, 2018

Appendix The Computer Architecture Group is young, has an average age of 32 years Some achievements (2000-2004): Papers in international conferences: 102 Papers in Journals: 53 (41 in JCR/SCI list) Regional, national and european funded projects (+/- 1M € in 5 years) November 24, 2018

Gratitudes DSG for providing full support for my work Specially Aamir and Raz for late, smoky and caffeinated DSG office hours Mark for hosting the visit and his valuable support ICG and UoP for the facilities and services Bryan Carpenter for his rare but valuable comments, and his help with some JNI pbs. DXIDI – Xunta de Galicia, for funding the visit November 24, 2018

A Coruña You will be always welcome to A Coruña! November 24, 2018

A Coruña You will be always welcome to A Coruña! November 24, 2018