Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Introduction to Openmp & openACC
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
*University of Utah † Lawrence Berkeley National Laboratory
Introductions to Parallel Programming Using OpenMP
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.
The hybird approach to programming clusters of multi-core architetures.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.
Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MPI3 Hybrid Proposal Description
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
1 Intel® Many Integrated Core (Intel® MIC) Architecture MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
Hybrid MPI and OpenMP Parallel Programming
Hiding Periodic I/O Costs in Parallel Applications Xiaosong Ma Department of Computer Science University of Illinois at Urbana-Champaign Spring 2003.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.
Interconnection network network interface and a case study.
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
For Massively Parallel Computation The Chaotic State of the Art
SHARED MEMORY PROGRAMMING WITH OpenMP
Scalable Parallel Interoperable Data Analytics Library
Hybrid Programming with OpenMP and MPI
Presentation transcript:

Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka Ishikawa Guest graduate student at Argonne National Laboratory, IL, USA Mentor : Antonio J. Peña Supervisor : Pavan Balaji Homepage:

Scientific Programming on Many-Core Architectures  Massive Parallel Many-Core Environment –Intel® Xeon Phi co-processor, Blue Gene/Q  Hardware Characteristics –Simple and low frequency cores –Other on-chip resources are growing at a lower rate… We need Parallelism and Resource Sharing ! Thus we are studying two popular programming models 1. Hybrid MPI+Threads model2. MPI RMA-based PGAS model Massive parallelism in computation On-chip resource sharing between threads Global shared arrays Memory sharing across nodes on distributed memory systems Typical Get-Compute-Update Local buffer ACCGET MPI Process OpenMP Threads 2

Communication Optimization in Hybrid MPI+threads Model  Problem Statement –Only single thread issues MPI –Most threads are IDLE during MPI calls  MT-MPI: Multi-threaded MPI [1] –Sharing Idle Threads with application inside MPI –Optimizing MPI internal processing by massive parallelism MPI_Function( ){ #pragma omp parallel { /* Internal Processing */ } } #pragma omp parallel { /* user computation */ } MPI_Function ( ); MPI Process COMP. MPI COMM. Many threads are IDLE ! Sharing IDLE threads inside MPI Hybrid NAS MG class E using 64 MPI processes on TACC Stampede supercomputer [1] M. Si, A. J. Pena, P. Balaji, M. Takagi, and Y. Ishikawa, “MT-MPI: Multithreaded MPI for Many-Core Environments,” in Proceedings of the 28th ACM international conference on Supercomputing, ICS

Process-based Asynchronous Progress Model for MPI RMA  MPI RMA Communication –Not truly one-sided  Casper : Process-based ASYNC Progress [2] –Flexible & Portable & Low overhead –Improve SW-handled RMA operations without affecting HW-handled RMA. Process 0Process 1 += Computation ACC(data) MPI call Communication Delay Process 0Process 1 += Computation Acc(data) Ghost Process MPI Recv Process-based Asynchronous Progress NWChem CCSD(T) simulation for C 20 on NERSC Edison supercomputer Reduced 45% 4 [2] M.Si, A.J.Pena, J.Hammond, P.Balaji, M.Takagi, and Y.Ishikawa, “Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures,” in Parallel and Distributed Processing, IPDPS 2015.