E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.

Slides:



Advertisements
Similar presentations
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
Advertisements

CS 201 Compiler Construction
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
Computer Abstractions and Technology
The University of Adelaide, School of Computer Science
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Data Locality CS 524 – High-Performance Computing.
1 Lecture 6 Performance Measurement and Improvement.
CSCE 212 Quiz 4 – 2/16/11 *Assume computes take 1 clock cycle, loads and stores take 10 cycles and branches take 4 cycles and that they are running on.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Systems of Linear Equations Iterative Methods. 2 B. Direct Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
What’s on the Motherboard? The two main parts of the CPU are the control unit and the arithmetic logic unit. The control unit retrieves instructions from.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Processor Technology John Gordon, Peter Oliver e-Science Centre, RAL October 2002 All details correct at time of writing 09/10/02.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Speed-up of the ring recognition algorithm Semeon Lebedev GSI, Darmstadt, Germany and LIT JINR, Dubna, Russia Gennady Ososkov LIT JINR, Dubna, Russia.
Performance Optimization Getting your programs to run faster CS 691.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Performance Optimization Getting your programs to run faster.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Chapter Overview General Concepts IA-32 Processor Architecture
4- Performance Analysis of Parallel Programs
SIMD Multimedia Extensions
The Hardware/Software Interface CSE351 Winter 2013
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Prof. Zhang Gang School of Computer Sci. & Tech.
University of Gujrat Department of Computer Science
Morgan Kaufmann Publishers
UNIVERSITY OF MASSACHUSETTS Dept
COMP4211 : Advance Computer Architecture
BLAS: behind the scenes
Implementation of IDEA on a Reconfigurable Computer
Array Processor.
Compiler Back End Panel
Compiler Back End Panel
Multivector and SIMD Computers
Siddhartha Chatterjee
Accelerating Quantum Chemistry with Batched and Vectorized Integrals
Learning Objectives To be able to describe the purpose of the CPU
Memory System Performance Chapter 3
January 25 Did you get mail from Chun-Fa about assignment grades?
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Cache Performance Improvements
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
CS2100 Computer Organisation
Presentation transcript:

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 2 COMPUTING PERFORMANCE/1 MTR – PT, H 2 O, O 3 – Orbit 2081 – 72 Sequences November, 2003 T = 10h 30m T = 10h 30m [ Alphaserver ES45, CPU 1 GHz ] T = 5h 45m T = 5h 45m [ Intel P-IV, CPU 2.8 GHz ]

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 3 COMPUTING PERFORMANCE/2 T = 5h 45m T = 5h 45m [ P-IV ] T FM = 101m x 3 = 303m= 88% T AL = 20m x 2 = 40m= 12% November, 2003 T = 54m 54s T = 54m 54s [ P-IV ] T FM = 53m 32s = 98% T AL = 77s = 2% February, 2004 FM = FORWARD MODEL, AL = MATRIX ALGEBRA T = 2h 0m 24s T = 2h 0m 24s [ Alpha ]

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 4 LOOP OPTIMIZATION do i=1, 1000 do j=1, 1000 a(i,j) = b(i,j) enddo Stride: constant offset between the addresses of the locations of successive elements of the array In Fortran arrays are stored in column- major order Stride 1000 Slow access Stride 1 Fast access! do j=1, 1000 do i=1, 1000 a(i,j) = b(i,j) enddo When the code was developed attention was mainly payd wrt correctness of results (not too much wrt speed)

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 5 INNER LOOP VECTORIZATION/1 SSE2 architecture(P-IV): can execute two double- precision 64-bit floating point operations, or four 32-bit single-precision operations for clock cycle x0x0 x1x1 x2x2 x3x3 y0y0 y1y1 y2y2 y3y3 x 0 +y 0 x 3 +y 3 x 2 +y 2 x 1 +y INTEL FORTRAN COMPILER (IFC) – PUBLIC DOMAIN CAN VECTORIZE REAL*4 AND REAL*8 WITH -Xw

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 6 INNER LOOP VECTORIZATION/2 do i=1,1000 do j=1,3 x(i,j) = y(i,j) + 9*z(i,j) enddo do i=1,1000 x(i,1) = y(i,1) + 9*z(i,1) x(i,2) = y(i,2) + 9*z(i,2) x(i,3) = y(i,3) + 9*z(i,3) enddo Non unit stride Inner loop too small Vectorizable! do i=1,1000 x(1,i) = y(1,i) + 9*z(1,i) x(2,i) = y(2,i) + 9*z(2,i) x(3,i) = y(3,i) + 9*z(3,i) enddo

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 7 PARALLEL ALGORITHM T = 8m 20s T FM = 6m 57s = 83% T AL = 77s = 16% February, 2004 [ Linux cluster, 8 Intel P-IV, CPU 2.8 GHz ] IF THE FM WAS COMPLETELY PARALLEL THE FM COMPUTING TIME WOULD BE: (53m 32s)/8 = 6m 41s FM IS PARALLEL (WITH 8 CPUs) AT 96% WORK OVER THE CPUs IS WELL BALANCED

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 8 SCALABILITY AND SPEED-UP THE SCALABILITY DESCRIBES THE ABILITY TO ACHIEVE PERFORMANCE PROPORTIONAL TO THE NUMBER OF PROCESSORS SPEED-UP A MEASURE OF THE SCALABILITY IS PROVIDED BY THE SPEED-UP: THE TIME SPENT TO SOLVE A PROBLEM ON ONE PROCESSOR DIVIDED BY THE TIME SPENT TO SOLVE THE SAME PROBLEM ON A NUMBER P OF PROCESSORS

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 9 SCALABILITY

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 10 SPEED-UP 72 is not divisible by 5 or 7!

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 11 CONCLUSIONS & PERSPECTIVES COMPUTING TIME: COMPUTING TIME: SOME IMPROVEMENTS CAN STILL BE OBTAINED (SPECIALLY ON THE ALPHA SYSTEM) MERMORY REQUIREMENTS: MERMORY REQUIREMENTS: FOR THIS TEST CASE THE MEMORY REQUIREMENTS IS 1.05 Gbyte ( 1.7 AT LAST MEETING) IT CAN BE REDUCED AT THE EXPENSES OF CODE READIBILITY (WORK IS IN PROGRESS)