An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements, and Evaluation Tom Scogland * P. Balaji + W. Feng * G. Narayanaswamy *

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Router Architectures An overview of router architectures.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Computer System Architectures Computer System Software

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct

Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Srihari Makineni & Ravi Iyer Communications Technology Lab

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.

Interconnection network network interface and a case study.

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Parallel Programming By J. H. Wang May 2, 2017.

CLUSTER COMPUTING.

Course Outline Introduction in algorithms and applications

Software models - Software Architecture Design Patterns

Hybrid Programming with OpenMP and MPI

Multithreaded Programming

Presentation transcript:

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp. Science Argonne National Laboratory

High-end Computing Trends High-end Computing (HEC) Systems –Continue to increase in scale and capability –Multicore architectures A significant driving force for this trend Quad-core processors from Intel/AMD IBM cell, SUN Niagara, Intel Terascale processor –High-speed Network Interconnects 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics Different stacks use different amounts of hardware support How do these two components interact with each other?

Multicore Architectures Multi-processor vs. Multicore systems –Not all of the processor hardware is replicated for multicore systems –Hardware units such as cache might be shared between the different cores –Multiple processing units embedded on the same processor die  inter-core communication faster than inter-processor communication On most architectures (Intel, AMD, SUN), all cores are equally powerful  makes scheduling easier

Interactions of Protocols with Multicores Depending on how the stack works, different protocols have different interactions with multicore systems Study based on host-based TCP/IP and iWARP TCP/IP has significant interaction with multicore systems –Large impacts on application performance iWARP stack itself does not interact directly with multicore systems –Software libraries built on top of iWARP DO interact (buffering of data, copies) –Interaction similar to other high performance protocols (InfiniBand, Myrinet MX, Qlogic PSM)

TCP/IP Interaction vs. iWARP Interaction Network TCP/IP stack App iWARP offloaded Network Library App Library TCP/IP is some ways more asynchronous or “centralized” with respect to host- processing as compared to iWARP (or other high performance software stacks) Packet Arrival Packet Processing Packet Arrival Packet Processing Host-processing independent of application process (statically tied to a single core) Host-processing closely tied to application process

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

MPI Bandwidth over TCP/IP

MPI Bandwidth over iWARP

TCP/IP Interrupts and Cache Misses

MPI Latency over TCP/IP (Intel Platform)

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

Application Behavior Pre-analysis A four-core system is effectively a 3.5 core system –A part of a core has to be dedicated to communication –Interrupts, Cache misses How do we schedule 4 application processes on 3.5 cores? If the application is exactly synchronized, there is not much we can do Otherwise, we have an opportunity! Study with GROMACS and LAMMPS

GROMACS Overview Developed by Groningen University Simulates the molecular dynamics of biochemical particles The root distributes a “topology” file corresponding to the molecular structure Simulation time broken down into a number of steps –Processes synchronize at each step Performance reported as number of nanoseconds of molecular interactions that can be simulated each day Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7 Combination A Combination B

GROMACS: Random Scheduling Machine 1 coresMachine 2 cores

GROMACS: Selective Scheduling Machine 1 coresMachine 2 cores

LAMMPS Overview Molecular dynamics simulator developed at Sandia Uses spatial decomposition techniques to partition the simulation domain into smaller 3-D subdomains –Each subdomain allotted to a different process –Interaction required only between neighboring subdomains – improves scalability Used the Lennard-Jones liquid simulation within LAMMPS Core 0Core 1Core 2Core 3 Core 0Core 1Core 2Core 3 Network

LAMMPS: Random Scheduling Machine 1 coresMachine 2 cores

LAMMPS: Intended Communication Pattern Computation MPI_Send() MPI_Irecv() MPI_Wait() MPI_Send() MPI_Irecv()

LAMMPS: Actual Communication Pattern Computation MPI_Send() MPI_Wait() MPI buffer Socket Send Buffer Socket Recv Buffer Application Recv Buffer MPI_Send() Application Recv Buffer “Slower” CoreFaster Core MPI buffer Socket Send Buffer Socket Recv Buffer MPI_Send() Application Recv Buffer “Slower” CoreFaster Core Computation “Out-of-Sync” Communication between processes

LAMMPS: Selective Scheduling Machine 1 coresMachine 2 cores

Presentation Layout Introduction and Motivation Treachery of Multicore Architectures Application Process to Core Mapping Techniques Conclusions and Future Work

Concluding Remarks and Future Work Multicore architectures and high-speed networks are becoming prominent in high-end computing systems –Interaction of these components is important and interesting! –For TCP/IP scheduling order drastically impacts performance –For iWARP scheduling order has no overhead –Scheduling processes in a more intelligent manner allows significantly improved application performance –Does not impact iWARP and other high-performance stack making the approach portable while efficient Dynamic process to core scheduling!

Thank You Contacts: Ganesh Narayanaswamy: Pavan Balaji: Wu-chun Feng: For More Information:

Backup Slides

MPI Latency over TCP/IP (AMD Platform)