1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Computer System Architectures Computer System Software

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Extracted directly from:

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

Srihari Makineni & Ravi Iyer Communications Technology Lab

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Interconnection network network interface and a case study.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.

Software Overhead in Messaging Layers Pitch Patarasuk.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

CS 704 Advanced Computer Architecture

High Performance and Reliable Multicast over Myrinet/GM-2

Alternative system models

Parallel Objects: Virtualization & In-Process Components

Task Scheduling for Multicore CPUs and NUMA Systems

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Performance Evaluation of Adaptive MPI

CMSC 611: Advanced Computer Architecture

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Component Frameworks:

Faucets: Efficient Utilization of Multiple Clusters

BigSim: Simulating PetaFLOPS Supercomputers

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Support for Adaptivity in ARMCI Using Migratable Objects

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant V. Kale Parallel Programming Laboratory University of Illinois at Urbana Champaign

2 Outline Processor virtualization QsNet Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary

3 Processor Virtualization Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI User View System implementation

4 QsNet Popular interconnect from Quadrics Several parallel systems in top500 use QsNet Pittsburgh ’ s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor

5 Elite Network 320 MB/s each way after protocol Reliable fat-tree network Multiple routes provides fault tolerance Adaptive worm hole routing 35 ns per hop

6 Elan Network Adaptor Features Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node) Components Sparc processor DMA Engine 64 MB RAM On chip cache

7 Low CPU Overhead CPU Overhead is small and does not change much with the message size

8 Traditional Message Passing Time P0P0 P1P1 Send OverheadReceive Overhead Idle Time Traditional Message Passing does not utilize low CPU overhead of Elan

9 Adaptive Overlap VP 0 VP 1 VP 0 VP 1 Time P0P0 P1P1 Send OverheadReceive Overhead Processor Virtualization takes full advantage of the low CPU overhead of Elan

10 Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.

11 Charm++ Message Driven Execution Handler Scheduler Pump Garbage Collection Send Tport Send Post Receives Receive Message

12 NAMD: A Production MD System Written in Charm++ Fully featured program NIH-funded development Distributed free of charge (5000+ downloads so far) Binaries and source code Installed at NSF centers Large published simulations (e.g., aquaporin simulation featured in keynote)

13 Scaling NAMD Several QsNet challenges had to be overcome to scale NAMD

14 QsNet Challange: Latency Applications need to post receives for messages of different sizes

15 Latency Bottlenecks Latency Slow NIC processor with a 100Mhz clock Cache size only 8KB Traversing a large loop flushes it Cache Misses vs Number of Receives Posted

16 Managing Latency: Message Combining Organize processors in a 2D (virtual) Mesh Phase 1: Processors send messages to row neighbors Message from (x1,y1) to (x2,y2) goes via (x1,y2) Phase 1: Processors send messages to column neighbors 2* messages instead of P-1

17 NAMD PME Performance Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

18 QsNet Challenge: Bandwidth MB/s One Way290 Two Way128 PCI/DMA Contention restricts bandwidth on Alpha servers QsNet Network Bandwidth 320 MB/s

19 Improving Bandwidth Main-MainElan-MainElan-Elan One Way Two Way Sending messages from Elan memory is faster Node bandwidth (MB/s) for different placements of source and destination

20 QsNet Challenge: Stretched Handlers Stretched Sends Green superscripts Similar stretches observed in the middle of entry methods NAMD Timeline Time Processors Force compute Integrate

21 Stretching Solution Stretched Sends Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged Solved the problem by closely working with Quadrics and obtaining a patch Isend only blocks on the rendezvous of the previous message to the same destination

22 Stretching Solution Contd. Stretches in the middle of entry methods Caused by OS daemons Using blocking receives minimized these stretches Daemons can be scheduled when processor is idle

23 NAMD With Blocking Receives Processors Time Blocking Receives

24 NAMD Performance on Lemieux

25 Summary QsNet and excellent network NIC co-processor ideal for message driven execution Programming guidelines Send messages from Elan memory Post limited number of receives and before the sends Blocking receives to avoid stretching

26 Future Work One sided communication Barrier? Persistent one sided communication Reserve buffers on destination

27 Fat Tree Topology a)b) c) 4-ary 1-tree 4-ary 2-tree 4-ary 3 tree

28 Elan3 Adapter DMA Engine Thread Processor On chip shared cache 64 bit 66 Mhz PCI interface 64 MB RAM

29 Object Based Communication Framework Application AMPI Charm++ Comm. Framework Object Layer Converse Comm. Framework Processor Layer Communication Layer Performs Object Level Optimizations Optimizes Inter-Processor Communication Strategy

30 AAPC Processor Overhead Mesh Completion Time Direct CPU Overhead Lower CPU overhead enables applications using Mesh to perform better even for large messages Direct Completion Time Mesh CPU Overhead