Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
Introductory Courses in High Performance Computing at Illinois David Padua.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Presented by Rengan Xu LCPC /16/2014
Introduction CS 524 – High-Performance Computing.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.
Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Multiscalar processors
Parallelizing Compilers Presented by Yiwei Zhang.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Thread-Level Speculation Karan Singh CS
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Enhancing the Role of Inlining in Effective Interprocedural Parallelization Jichi Guo, Mike Stiles Qing Yi, Kleanthis Psarris.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Threaded Programming Lecture 2: Introduction to OpenMP.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
A Practical Stride Prefetching Implementation in Global Optimizer
Compiler Code Optimizations
Multithreading Why & How.
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Dynamic Hardware Prediction
Parallel Computing Explained How to Parallelize a Code
An Evaluation of Auto-Scoping in OpenMP
Presentation transcript:

Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON 2004

Compiler technology © 2004 IBM Corporation October 6, 2004 Guansong Zhang, Priya Unnikrishnan, James Ren Compiler Development Team IBM Toronto Lab, Canada (most slides from Priya’s LCPC2004 presentation) Authors

Compiler technology © 2004 IBM Corporation October 6, 2004 Overview  Introduction and motivation  Structure of our auto-parallelizer  Performance results  Limitations and future work

Compiler technology © 2004 IBM Corporation October 6, 2004 Auto parallelization, again?  HPF (Fortran D) –V ; V  MPI –V 1.0, 1994; V  OpenMP –V 1.0, 1997; V ; V  Other parallel programming tools/models –Global array (1994), HPJava(1998), UPC(2001),  Auto parallelization tools –ParaWise (CAPtools, 2000), Other efforts: Polaris, SUIF, …

Compiler technology © 2004 IBM Corporation October 6, 2004 The most effective way in parallelization Discover parallelism in the algorithm

Compiler technology © 2004 IBM Corporation October 6, 2004 So, why?  User expertise required –Knowledge of parallel programming, dependence analysis –Knowledge of the application, time and effort  Extra cycles in desktop, even laptop computing –hyperthread –multicore

Compiler technology © 2004 IBM Corporation October 6, 2004 SPEC, from CPU2000 to OMP  Parallel programming is difficult. –Even for just getting it right  Parallelizable problem exist –Amdahl's Law vs. Gustafson's Law –10 of the14 SPEC CPU FP tests are in SPECOMPM –9 of the 11 SPECOMPM tests are in SPECOMPL

Compiler technology © 2004 IBM Corporation October 6, 2004 Strategy  Make parallelism accessible to general users –Shields users from low-level details  Take advantage of extra hardware –Do not waste the cycle and the power  Our emphasis –Use simple parallelization techniques –Balance performance gain and compilation time

Compiler technology © 2004 IBM Corporation October 6, 2004 Compiler infrastructure

Compiler technology © 2004 IBM Corporation October 6, 2004 Our auto-parallelizer features  Use OpenMP compiler and runtime infrastructure –for parallelization and parallel execution.  Essentially a Loop parallelizer, –inserting “Parallel do” directives  Can further optimize an OpenMP programs –general optimization specific to an OpenMP program  Depending on dependence analyzer –core of the parallelizer, shared with other loop optimizations

Compiler technology © 2004 IBM Corporation October 6, 2004 Pre-parallelization Phase  Induction variable elimination  Scalar Privatization  Reduction finding  Loop transformations favoring parallelism

Compiler technology © 2004 IBM Corporation October 6, 2004 Basic loop parallelizer

Compiler technology © 2004 IBM Corporation October 6, 2004 user parallel sequential dependence compiletime cost insufficient cost insert runtime cost expr mark auto parallel loop side-effects for each loop in nest for each loop nest in proc yes no split loop dependence no yes no

Compiler technology © 2004 IBM Corporation October 6, 2004 Loop Cost  LoopCost = ( Iterationcount * ExecTimeOfLoopBody )  Compile time cost  (LoopCost < Threshold)  reject  Runtime loop cost expression – extremely light-weight  Runtime profiling – finer granularity filtering

Compiler technology © 2004 IBM Corporation October 6, 2004 Impact of Loop cost on performance

Compiler technology © 2004 IBM Corporation October 6, 2004 Accuracy of loop cost algorithm Benchmark#Parallelizable HighCostLoops from PDF #HighCostLoops selected by Parallelizer #LowCostLoops selected by Parallelizer swim550 mgrid770 applu11 0 galgel49360 sixtrack13110 fma3d33 0

Compiler technology © 2004 IBM Corporation October 6, 2004 Balance coarse-grained parallelism and locality  Loop interchange for data locality  Loop interchange to exploit parallelism  Transformations do not always work in harmony DO I = 1, N  FORTRAN loop nest DO J = 1, N DO K = 1, N A( I, J, K) = A( I, J, K+1) END DO

Compiler technology © 2004 IBM Corporation October 6, 2004 Performance: loop permutation for parallelism

Compiler technology © 2004 IBM Corporation October 6, 2004 Auto-parallelization performance – 10%

Compiler technology © 2004 IBM Corporation October 6, 2004 Expose limitations  Compare SPEC2000FP and SPECOMP  SPECOMP achieves good performance and scalability –Disparity between explicit and auto-parallelization  Expose missed opportunities  10 common benchmarks –Compare on a loop-to-loop basis

Compiler technology © 2004 IBM Corporation October 6, 2004 Limitations  Loop body contains function calls  Array privatization COMPLEX*16 AUX1(12),AUX3(12).... DO 100 JKL = 0, N2 * N3 * N4 - 1 DO 100 I=(MOD(J+K+L,2)+1),N1,2 IP=MOD(I,N1)+1 CALL GAMMUL (1,0,X(1,(IP+1)/2,J,K,L),AUX1) CALL SU3MUL (1,1,1,I,J,K,L),'N',AUX1,AUX3) CONTINUE

Compiler technology © 2004 IBM Corporation October 6, 2004 Limitations … contd  Zero trip loops IV=0 DO J=1, M DO I=1,N IV=IV+1 A(IV) = 0 ENDDO –Induction variable ‘IV=I+(J-1)*N’ –Valid if N is positive. –Cannot parallelize outer-loop if N is zero

Compiler technology © 2004 IBM Corporation October 6, 2004 Improved auto-parallelization performance

Compiler technology © 2004 IBM Corporation October 6, 2004 System Configuration  SPEC2000 CPU FP benchmark suite  IBM XL Fortran/C/C++ commercial compiler infrastructure which implements OpenMP 2.0  Hardware : 1.1GHz POWER4 with 1-8 nodes  Compiler options: -O5 –qsmp –Comparing to –O5 as sequential

Compiler technology © 2004 IBM Corporation October 6, 2004 Future Work  Fine tune the heuristics – Loop cost, permutation, unroll.  Further loop parallelization – Array dataflow analysis, array privatization – Do across, loop with carried dependence – Interprocedural, runtime dependence analysis  Speculative execution – OpenMP threadprivate, sections, task queue  Keep reasonable increase in compilation time – not to compete with auto-par tools in the near future