 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC.

Slides:



Advertisements
Similar presentations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Advertisements

Lecture 6: Multicore Systems
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
Presented by Rengan Xu LCPC /16/2014
Introduction CS 524 – High-Performance Computing.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
1 Lecture 6 Performance Measurement and Improvement.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Higher Computing Computer structure. What we need to know! Detailed description of the purpose of the ALU and control unitDetailed description of the.
Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Computer Performance Computer Engineering Department.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.
Srihari Makineni & Ravi Iyer Communications Technology Lab
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Workstations CPU David Josué Morales José Luis Micheri Marie Muñoz The Dinosaurs Electrical and Computer Engineering Department August 25, 2004.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Three State Air Quality Study (3SAQS) Benchmarking for 3SAQS base 2008 CAMx simulations.
Chapter Overview Microprocessors Replacing and Upgrading a CPU.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Real-Time Simulation of 3-Level STATCOM With 72 Switches Topology OPAL-RT TECHNOLOGIES Montreal, Quebec, Canada EMS Rev. 001, October 20,
Full and Para Virtualization
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Outline Why this subject? What is High Performance Computing?
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Single Node Optimization Computational Astrophysics.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sunpyo Hong, Hyesoon Kim
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Performance. Moore's Law Moore's Law Related Curves.
Employing compression solutions under openacc
Defining Performance Which airplane has the best performance?
Morgan Kaufmann Publishers
CS2100 Computer Organisation
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
CS2100 Computer Organisation
Presentation transcript:

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC HiPERiSM Consulting, LLC.

 Copyright, HiPERiSM Consulting, LLC, PERFORMANCE OF AQM APPLICATIONS ON COMMODITY LINUX PLATFORMS George Delic, Ph.D. Models-3 User’s Workshop October 18-20, 2004 Chapel Hill, NC

 Copyright, HiPERiSM Consulting, LLC, Overview 1.Introduction 2.Choice of Hardware, Operating System, & Compilers 3.Choice of Benchmarks 4.Results of Benchmarks 5.Conclusions 6.Outlook

 Copyright, HiPERiSM Consulting, LLC, Introduction  Motivation  AQM’s are migrating to COTS hardware  Linux is preferred  Rich choice of compilers and tools is available  Need to learn about portability issues  What is known about compilers for COTS?  Need a requirements analysis of differences in Performance Numerical accuracy & stability Portability issues

 Copyright, HiPERiSM Consulting, LLC, Choice of Hardware, Operating System, and Compilers  Hardware  Intel Pentium 4 Xeon (3 GHz, dual processor) with SSE2 extensions and 1MB L3 cache  Linux kernel  Fortran compilers for IA-32 Linux  Absoft 8.0(9.0 current release)  Intel 8.0 (8.1 released 9/13/04)  Lahey 6.2  Portland CDK 5.1 (5.2 current release)

 Copyright, HiPERiSM Consulting, LLC, Choice of Benchmarks 3.1 STREAM memory benchmark  Developed by John D. McCalpin (VA Tech)  Available from  Four kernels to measure memory bandwidth  Useful on commodity hardware where multiple CPUs share the system bus  Serial (used here) or OpenMP versions (see HiPERiSM’s URL)

 Copyright, HiPERiSM Consulting, LLC, The memory bandwidth problem (STREAM logo used with permission)

 Copyright, HiPERiSM Consulting, LLC, Compute Kernels of the STREAM benchmark Best results of ten trials with an iteration range of 1 to 20 x 10 6 NoNameKernelBytes/ iterate Flops/ iterate Mops/ iterate 1COPYa(i)=b(i)1602 2SCALEa(i)=q*b(i)1612 3ADDa(i)=b(i) + c(i)2413 4TRIADa(i)=b(i) + q*c(i)2423

 Copyright, HiPERiSM Consulting, LLC, Choice of Benchmarks (cont.) 3.2 Princeton Ocean Model (POM)  Example of “real-world” code that is numerically unstable with sp arithmetic!  500+ vectorizable loops to exercise compilers  9 procedures account for 85% of CPU time  2-Day simulation for three (i,j,k) grids:  Grid 1: 100 x 40 x 15Scaling = 1  Grid 2: 128 x 120 x 16Scaling = 4.4  Grid 3: 256 x 256 x 16Scaling = 17.5

 Copyright, HiPERiSM Consulting, LLC, Choice of Benchmarks (cont.) 3.3 MM5 Community Model v5.3  History of development on vector machines  Performance results for many platforms  Used Storm-of-the-Century (SOC) benchmark  Compared Intel and Portland compilers

 Copyright, HiPERiSM Consulting, LLC, Choice of Benchmarks (cont.) 3.4 CAMx  Developed by ENVIRON  Available from  For benchmark used 8/22-8/31, 2000, Scenario for Houston Greater Metro area with TCEQ data  Used the UT/UNC version of CAMx Grateful acknowledgements to Dr. Harvey Jeffries and his Ph.D. student, Byeong-Uk Kim, for sharing the code and for help on how to use it.

 Copyright, HiPERiSM Consulting, LLC, Results of Benchmarks 4.1 STREAM results  Tested with four compilers  Without and with compiler optimization switches Note: STREAM kernels are designed to “fool” optimizing compilers so differences with and without optimization should be small.

 Copyright, HiPERiSM Consulting, LLC, Memory bandwidth for STREAM (MB/second with no optimization) NameAbsoftIntelLaheyPortland COPY SCALE ADD TRIAD

 Copyright, HiPERiSM Consulting, LLC, Memory bandwidth for STREAM (MB/second with optimization) NameAbsoftIntelLaheyPortland COPY SCALE ADD TRIAD

 Copyright, HiPERiSM Consulting, LLC, Memory bandwidth for STREAM: ratio of rates

 Copyright, HiPERiSM Consulting, LLC, Summary of STREAM results  Different compilers produce code with different memory performance  Performance enhancement with optimization enabled is most noticeable for kernels 1 & 2 with all compilers  Enabling optimization for the Intel compiler boosts memory bandwidth by factors x 2 for kernels 1 and 2 x 1.7 for kernels 3 and 4 Note: these results are specific to serial code – for OpenMP results see the HiPERiSM URL

 Copyright, HiPERiSM Consulting, LLC, Results of Benchmarks (cont.) 4.2 POM results  Tested four compilers without SSE2  Tested three compilers with SSE2  Looked at differences with problem size scaling:  Grid 1: 100 x 40 x 15Scaling = 1  Grid 2: 128 x 120 x 16Scaling = 4.4  Grid 3: 256 x 256 x 16Scaling = 17.5

 Copyright, HiPERiSM Consulting, LLC, Comparing Execution Times: POM without SSE (seconds) GridAbsoftIntelLaheyPortland Note: The Lahey compiler required the “—long” switch for large integers with larger grids and this reduced performance.

 Copyright, HiPERiSM Consulting, LLC, Comparing Execution Times: POM

 Copyright, HiPERiSM Consulting, LLC, Comparing Execution Times: POM with SSE2 (% gain vs no SSE2) GridIntelLaheyPortland

 Copyright, HiPERiSM Consulting, LLC, Summary of POM results  Without SSE2  Grid 2 and Grid 3 show large variability in performance due to Lahey results  Grid 1-3 Intel compiler is best (memory boost?)  With SSE2  Performance gain increases with problem size for both Intel and Portland compilers:  ~ 26% for Grid 2  ~ 30% for Grid 3  Intel shows smallest wall clock time (vs. Portland)  x 1.1 for Grid 2  x 1.2 for Grid 3

 Copyright, HiPERiSM Consulting, LLC, Results of Benchmarks (cont.) 4.3 MM5 Results  Used Storm-of-the-Century (SOC) benchmark  Compared Intel and Portland compiler with three switch groups: noopt, opt, SSE Note: There is no exact equivalence for the vect group of switches because the Intel compiler does not distinguish vector and SSE instructions.

 Copyright, HiPERiSM Consulting, LLC, Comparing Execution Times: MM5

 Copyright, HiPERiSM Consulting, LLC, Summary of MM5 results  Intel performance gain vs noopt  Opt switch group yields 55%  SSE switch group yields 66%  Portland performance gain vs noopt  opt switch group yields 32%  SSE switch group yields 38%  Speed up of Intel vs Portland  for opt switch group x 1.26  for SSE switch group x 1.51  Overall Intel delivers smallest wall clock time

 Copyright, HiPERiSM Consulting, LLC, Results of Benchmarks (cont.) 4.4 CAMx results  Tested Portland pgf90 compiler with five switch groups that progressively enable higher level optimizations:  noopt(baseline: no optimization)  opt(scalar optimizations)  vect(vector optimizations)  SSE(SSE for vector instructions)  FSSE(SSE for scalar and vector instructions) Note: Portland compiler technical support made some suggestions.

 Copyright, HiPERiSM Consulting, LLC, Comparing Execution Time: CAMx

 Copyright, HiPERiSM Consulting, LLC, Summary of CAMx results  NO REDUCTION IN WALL CLOCK TIME FROM ANY COMPILER OPTIMIZATIONS !!!!!  CAMx receives no benefit from this generation of hardware and compiler technology which has been demonstrated to produce such benefits in other cases.  Possible reasons: This is scalar code and vector potential is inhibited CAMx is I/O bound and writes ONE word at a time in implicit nested loops !!!!!!!!!!!!! CAMx is sensitive to memory and I/O bandwidth.

 Copyright, HiPERiSM Consulting, LLC, Conclusions  STREAM shows that memory bandwidth is the principal differentiator for COTS (not clock speed).  POM and MM5 showed that legacy vector codes receive performance boosts from optimizations and SSE on COTS hardware with current compilers and performance gains increase with larger problem sizes  CAMx: has serious problems on COTS and does not benefit from the performance potential available now.  Consequences for CAMx: re-design of code to reach potential performance limits is advisable.

 Copyright, HiPERiSM Consulting, LLC, Outlook  Hardware: COTS is delivering good performance on legacy vector code. The outlook is good for such code since future HPC CPU’s will have longer pipelines and larger cache.  Linux: Operating System is sufficiently reliable.  Programming Environment: rich in compiler and tools technology for code developers.  Consequences for AQM: the outlook for hardware, Linux, and programming environment requires careful on-going re-design of code to reach potential performance limits of future COTS technology.

 Copyright, HiPERiSM Consulting, LLC, HiPERiSM’s URL Technical Reports pages for details of the compiler switches