Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
CEG 4131-Fall Graphics Processing Unit GPU CEG4131 – Fall 2012 University of Ottawa Bardia Bandali CEG4131 – Fall 2012.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Multi-core Acceleration of NWP John Michalakes, NCAR John Linford, Virginia Tech Manish Vachharajani, University of Colorado Adrian Sandu, Virginia Tech.
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Contemporary Languages in Parallel Computing Raymond Hummel.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
High performance computing architecture examples Unit 2.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
GPU Architecture and Its Application
Parallel Computing Lecture
Graphics Processing Unit
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,
Parallel Computers Today
All-Pairs Shortest Paths
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
Multicore and GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies

2 Meredith_Architectures_SC07 Overview MotivationGoals  Technologies on the horizon several orders of magnitude more powerful and power- efficient than microprocessors  Grand challenge applications requiring several orders of magnitude more performance than now available  Gauge performance of emerging devices using high-level languages  Estimate productivity with respect to contemporary microprocessing devices 2006 Performance (GF) CPU GPU

3 Meredith_Architectures_SC07 Intel Xeon multicore CPU  Dual-core (Woodcrest)  Quad-core (Clovertown)  Programming models  MPI  OpenMP  Pthreads  Recourse contention  Memory bandwidth  I/O bandwidth Blackford MCH ESB-2 I/O bridge FBD Configurable set of PCIe ports x8 ESI DIB 1066/1333 MHz 8.5/10.5 GB/s Up to 21 GB/s Dempsey Woodcrest Clovertown PCIe slot Sunrise Lake PCI-X 10 GbE I/OAT iSCSI SAS/SATA-2PCI-X10 GbE x8

4 Meredith_Architectures_SC07 BIC IBM Cell broadband engine  Cell heterogeneous multicore processor  8 synergistic processing element (SPE) cores  200 GFLOPS at 3.2 GHz  200-GB/s memory bandwidth  Programming models  Multisource compilation  Threadlike launch semantics for SPEs SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC EIB (up to 96 B/cycle) L2 PPU LIPXU BIC Dual XDR  Flex10  Source: M. Gschwiind et al., Hot Chips-17, August 2005 SPE PPE

5 Meredith_Architectures_SC07 L2 Tex Graphics processing unit (GPU)  NVIDIA 7900GTX  24 SIMD pixel pipelines  200 GFLOPS at 650 MHz  50-GB/s memory bandwidth  Programming models  Multiple-source compilation  Gather semantics define parallelism  Host CPU drives program setup and data transfer Triangle setup Memory partition Z-cullShader instruction dispatch Vertex shader units Pixel pipelines ROP engine Fragment crossbar

6 Meredith_Architectures_SC07 Extreme multithreading with Cray MTA-II  Cray XMT system  MTA-I and MTA-II processor architecture  XT3/4 scalable infrastructure  AMD Torrenza technology  Fine-grain multithreading (128 concurrent threads)  Uniform memory hierarchy in MTA-I and MTA-II Programs running in parallel Concurrent threads of computation Hardware streams (128) Instruction ready pool Pipeline of executing instructions i=n i=3 i=2 i=1 i=n i=1 i=0 Sub- problem A Sub- problem B Serial Code Sub-problem A Unused streams

7 Meredith_Architectures_SC07 Application kernels Molecular dynamics (MD) calculations  Applications in biology, chemistry, and materials  Newton’s second law of motion  Bonded calculations  Nonbonded electrostatic and van der Waals forces  Workload characteristics  Floating-point intensive  Irregular memory access patterns Covariance matrix calculation  Applications in hyperspectral imaging and machine learning  Determines covariance among samples in data  Workload characteristics  Memory-bandwidth and floating-point intensive  Regular memory access patterns

8 Meredith_Architectures_SC07 Performance  OpenMP improves performance moderately on commodity CPUs  High parallelism of Cell and GPU can improve performance, but more so when memory access is regular Molecular DynamicsCovariance Matrix Speedup Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) Cell (8 SPEs) GPU (NVIDIA 7900GTX) MTA-2 (1 processor, 128 streams)

9 Meredith_Architectures_SC07 Productivity  Despite small performance increases through OpenMP, the ease of using it means that productivity can still be increased  Conversely, the high speed of Cell and GPU means that even substantial effort results in higher productivity Molecular DynamicsCovariance Matrix Productivity improvement Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) Cell (8 SPEs) GPU (NVIDIA 7900GTX) MTA-2 (1 processor, 128 streams)

10 Meredith_Architectures_SC Molecular DynamicsCovariance Matrix Speedup Scaling when accessing multiple devices Productivity scaling Increased parallelism from all architectures with little increased effort results in higher productivity Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) MTA-2 (32 processors)

11 Meredith_Architectures_SC07 Contacts Jeremy Meredith Future Technologies Group (865) Sadaf Alam Future Technologies Group (865) Jeffrey Vetter Future Technologies Group (865)