Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Data Locality CS 524 – High-Performance Computing.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Sunpyo Hong, Hyesoon Kim
Communication Optimizations in Titanium Programs Jimmy Su.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Ioannis E. Venetis Department of Computer Engineering and Informatics
5.2 Eleven Advanced Optimizations of Cache Performance
Parallel and Multiprocessor Architectures
Department of Computer Science University of California, Santa Barbara
LoGPC: Modeling Network Contention in Message-Passing Programs
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Memory System Performance Chapter 3
Department of Computer Science University of California, Santa Barbara
Support for Adaptivity in ARMCI Using Migratable Objects
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen

Unified Parallel C at LBNL/UCB Motivation Increasing productivity: need compiler, run-time based optimizations. Optimizations need to be performance portable. Reducing communication overhead is an important optimization for parallel applications Applications written with bulk transfers or compiler may perform message “coalescing” Coalescing reduces message start-up time, but does not hide communication latency Can we do better?

Unified Parallel C at LBNL/UCB Message Strip-Mining shared [] double *p; float *buf; get(buf,p,N*8); for(i=0;i<N;i++) …=buf[i]; h0 = nbget(buf, p, S); for(i=0; i < N; i+=S) h1=nbget(buf+S*(i+1),p+S*(i+1),S); sync(h0); for(ii=i; ii < min(...); ii++)...=buf[ii]; h0=h1; initial loop N = # remote elts strip-mined loop S = strip size U = unroll depth MSM (Wakatani) - divide communication and computation into phases and pipeline their execution N=3 communicate compute sync 1 sync 2 S=U=1

Unified Parallel C at LBNL/UCB Increased message start-up time, but potential for overlapping communication with computation. Unrolling increases message contention Goal: find heuristics that allow us to automate MSM in a performance portable way. Benefits both compiler based optimizations and “manual” optimizations Decomposition strategy dependent on:  system characteristics (network, processor, memory performance)  application characteristics (computation, communication pattern) How to combine? Performance Aspects of MSM

Unified Parallel C at LBNL/UCB Machine Characteristics Network performance: LogGP performance model (o,g,G) Contention on the local NIC due to increased number of requests issued Contention on the local memory system due to remote communication requests (DMA interference) P0 P1 o send L o recv MSM o unrolling g P0 o send gap

Unified Parallel C at LBNL/UCB Application Characteristics Transfer size - long enough to be able to tolerate increased start-up times (N,S) Computation - need enough available computation to hide the cost of communication ( C(S) ) Communication pattern - determines contention in the network system ( one-to- one or many-to-one )

Unified Parallel C at LBNL/UCB Questions What is the minimum transfer size that benefits from MSM? What is the minimum computation latency required? What is an optimal transfer decomposition?

Unified Parallel C at LBNL/UCB Analytical Understanding Vectorized loop: T vect = o + G*N+C(N) MSM + unrolling: W(S 1 ) = G*S 1 - issue(S 2 ) W(S 2 ) = G*S 2 - C(S 1 ) - W(S 1 ) - issue(S 3 ).... W(S m ) = G*S m - C(S m-1 ) - W(S m-1 ) Minimize communication cost: T strip+unroll = ∑ m issue(S i )+W(S i ) S U issue W C(S)

Unified Parallel C at LBNL/UCB Experimental Setup GasNet communication layer (performance close to native) Synthetic and application benchmarks Vary N - total problem size  S - strip size  U - unroll depth  P - number of processors  communication pattern SystemNetworkCPU IBM Netfinity clusterMyrinet MHZ Pentium PIII IBM RS/6000SP Switch 2375 MHz Power 3+ Compaq Alphaserver ES45 Quadrics1 GHz Alpha

Unified Parallel C at LBNL/UCB Minimum Message Size What is the minimum transfer size that benefits from MSM?  Minimum cost is o+max(o,g)+   Need at least two transfers  Lower bound: N > max(o,g)/G  Experimental results : 1KB < N < 3KB  In practice: 2KB

Unified Parallel C at LBNL/UCB Computation What is the minimum computation latency required to see a benefit? Computation cost: cache miss penalties + computation time Memory Cost: compare cost of moving data over the network to the cost of moving data over the memory system. SystemInverse Network Bandwidth (  sec/KB) Inverse Memory Bandwidth (  sec/KB) Ratio (Memory/Network) Myrinet/PIII % SPSwitch/PPC % Quadrics/Alpha % No minium exists: MSM always benefits due to memory costs

Unified Parallel C at LBNL/UCB NAS Multi-Grid (ghost region exchange) NetworkNo ThreadsBase (1)Strip-MiningSpeed-up Myrinet SP Switch Quadrics

Unified Parallel C at LBNL/UCB Decomposition Strategy What is an optimal transfer decomposition? transfer size - N computation - C(S i ) = K*S i communication pattern - one-to-one, many-to-one Fixed decomposition: simple. Need to search the space of possible decompositions. Not optimal overlap due to oscillations of waiting times. Idea: try a variable block-size decomposition Block size continuously increases S i = (1+ f )*S i-1 How to determine values for f ?

Unified Parallel C at LBNL/UCB Benchmarks Two benchmarks  Multiply accumulate reduction (same order of magnitude with communication) ( C(S) = G*S )  Increased computation (~20X) ( C(S) = 20*G*S ) Total problem size N: 2 8 to 2 20 (2KB to 8MB) Variable strip decomposition f tuned for the Myrinet platform. Same value used over all systems

Unified Parallel C at LBNL/UCB Transfer Size Variation of size for optimal decomposition (Myrinet) MAC reduction

Unified Parallel C at LBNL/UCB Computation: MAC Reduction

Unified Parallel C at LBNL/UCB Increased Computation

Unified Parallel C at LBNL/UCB Communication Pattern Contention on the memory system and NIC Memory system: measure slowdown of computation on “node” serving communication requests 3%-6% slowdown NIC contention - resource usage and message serialization

Unified Parallel C at LBNL/UCB Network Contention

Unified Parallel C at LBNL/UCB Summary of Results MSM improves performance, able to hide most communication overhead Variable size decomposition is performance portable (0%-4% on Myrinet, 10%-15% with un-tuned implementations) Unrolling influenced by g. Not worth with large degree (U=2,4) For more details see full paper at

Unified Parallel C at LBNL/UCB MSM in Practice Fixed decomposition - performance depends on N/S Search decomposition space. Prune based on heuristics: N  -S , C  -S , P  -S  Requires retuning for any parameter change Variable size - performance depends on f Choose f based on memory overhead (0.5) and search. Small number of experiments

Unified Parallel C at LBNL/UCB Implications and Future Work Message decomposition for latency hiding worth applying on a regular basis Ideally done transparently through run-time support instead of source transformations. Current work explored using only communication primitives on contiguous data. Same principles apply for strided/”vector” accesses - need unified performance model for complicated communication operations Need to combine with a framework for estimating the optimality of compound loop optimizations in the presence of communication - benefits all PGAS languages

Unified Parallel C at LBNL/UCB END

Unified Parallel C at LBNL/UCB Performance Aspects of MSM MSM - decompose large transfer into stripes, transfer of each stripe overlapped with communication Unrolling increases overlap potential by increasing the number of messages that can be issued However:  MSM increases message startup time  unrolling increases message contention How to combine? - determined by both hardware and application characteristics