CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
1 Chapter 04 Authors: John Hennessy & David Patterson.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.
CS427 Multicore Architecture and Parallel Computing
Parallel Computing Lecture
Chapter 1 Fundamentals of Computer Design
Chapter 1 Introduction.
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture 3 Laws, Equality, and Inside a Cell

CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell

CISC 879 : Software Support for Multicore Architectures Two Important Laws Amdahl’s Law Gene Amdahl observation in 1967 Speedup is limited by serial portions Assumes fixed workloads and fixed problem size Gustafson’s Law John Gustafson observation in 1988 Rescues parallel processing from Amdahl’s Law Proposes fixed time and increasing work Sequential portions have diminishing effect

CISC 879 : Software Support for Multicore Architectures Amdahl’s Law 100 Sequential 100 Sequential Parallelize parts 2 and 4 with 2 processors 50 Speedup: 25%

CISC 879 : Software Support for Multicore Architectures Amdahl’s Law (cont’d) 100 Sequential 100 Sequential 50 Speedup: 40% 25 Parallelize parts 2 and 4 with 4 processors

CISC 879 : Software Support for Multicore Architectures Amdahl’s Law (cont’d) 100 Sequential 100 Sequential 50 Speedup: only 70% Parallelize parts 2 and 4 with infinite processors Multicore doesn’t look very appealing!

CISC 879 : Software Support for Multicore Architectures Gustafson’s Law (cont’d) 100 Sequential 100 Sequential 200 Speedup: 40% Boxes contain units of work now! 500 units of time, but 700 units of work!

CISC 879 : Software Support for Multicore Architectures Gustafson’s Law (cont’d) 100 Sequential 100 Sequential 200 Speedup: 220% Boxes contain units of work now! 500 units of time, but 1100 units of work! 400

CISC 879 : Software Support for Multicore Architectures Gustafson Law (cont’d) Gustafson found important observation As processors grow, people scale problem size Serial bottlenecks do not grow with problem size Increasing processors gives linear speedup 20 processors roughly twice as fast as 10 This is why supercomputers are successful More processors allows increased dataset size Reference:

CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell

CISC 879 : Software Support for Multicore Architectures All Multicores Not Equal Multicore CPUs and GPUs are very different! CPUs run general purpose programs well GPUs run graphics (or similar prgs) well General Purpose Programs have Less parallelism More complex control requirements GPU programs Highly parallel Arithmetic intense Simple control requirements

CISC 879 : Software Support for Multicore Architectures Floating-Point Operations GPUs : more computational units and take better advantage of them. 32-bit FP operations per second Slide Source: NVIDIA CUDA Programming Guide 1.1

CISC 879 : Software Support for Multicore Architectures CPUs versus GPUs CPUs devote lots of area to control and storage. GPUs devote most area to computational units. Slide Source: NVIDIA CUDA Programming Guide 1.1

CISC 879 : Software Support for Multicore Architectures CPU Programming Model Slide Source: John Owens, EEC 227 Graphics Arch course Scalar programming model No native data parallelism Few arithmetic units Very small area Optimized for complex control Optimized for low latency not high bandwidth

CISC 879 : Software Support for Multicore Architectures AMD K7 “Deerhound” Slide Source: John Owens, EEC 227 Graphics Arch course

CISC 879 : Software Support for Multicore Architectures GPU Programming Model Slide Source: John Owens (EEC 227 Graphics Arch) and Pat Hanrahan (Stream Prog. Env., GP^2 Workshop) Streams Collections of data records Data parallelism amenable Kernels Inputs/outputs are streams Performs computation on each element of stream No dependencies between stream elements Stream storage Not cache (input read once/output written once) Producer-consumer locality

CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell

CISC 879 : Software Support for Multicore Architectures Cell B.E. Design Goals An accelerator extension to Power Exploits parallelism and achieves high frequency Sustain high memory bandwidth through DMA Designed for flexibility Heterogenous architecture PPU for control, general-purpose SPU for computation-intensive, little control Applicable to a wide variety of applications The Cell Architecture has characteristics of both a CPU and GPU.

CISC 879 : Software Support for Multicore Architectures Cell Chip Highlights Slide Source: Michael Perrone, MIT Fall 2007 course 241M Transistors 9 cores, 10 threads >200 GFlops (SP) >20 GFlops (DP) > 300 GB/s EIB 3.2 GHz shipping Top freq. 4.0 GHz (in lab)

CISC 879 : Software Support for Multicore Architectures Cell Details Slide Source: Michael Perrone, MIT Fall 2007 course Heterogenous multicore architecture Power Processor Element (PPE) for control tasks Synergistic Processor Element (SPE) for data- intensive processing SPE Features No cache Large unified register file Synergistic Memory Flow Control (MFC) Interface to high-perf. EIB

CISC 879 : Software Support for Multicore Architectures Cell PPE Details Slide Source: Michael Perrone, MIT Fall 2007 course Power Processor Element (PPE) General Purpose 64-bit PowerPC RISC processor 2-way hardware threaded L1 32KB I; 32KB D L2 512 KB For operating systems and program control

CISC 879 : Software Support for Multicore Architectures Cell SPE Details Slide Source: Michael Perrone, MIT Fall 2007 course Synergistic Processor Element (SPE) 128-bit SIMD architecture Dual Issue Register File 128x128-bit Load Store (256KB) Simplified Branch Arch. No hardware BR predictor Compiler-managed hint Memory Flow Controller Dedicated DMA engine - Up to 16 outstanding requests

CISC 879 : Software Support for Multicore Architectures Compiler Tools Slide Source: Michael Perrone, MIT Fall 2007 course Gnu based C/C++ compiler (Sony) ppu-gcc/ppu-g++ - generates ppu code spu-gcc/spu-g++ - generates spu code Gdb debugger Supports both PPU and SPU debugging Different modes of execution

CISC 879 : Software Support for Multicore Architectures Compiler Tools Slide Source: Michael Perrone, MIT Fall 2007 course The XLC/C++ compiler ppuxlc/ppuxlc++ - generates ppu code spuxlc/spuxlc++ - generates spu code Includes the following optimization levels -O0: almost no optimization -O2: strong, low-level optimization -O3: intense, low-level opts with basic loop opts -O4: all of -O3 and detaild loop analysis and good whole program analysis -O5: all of -O4 and detailed whole-program analysis

CISC 879 : Software Support for Multicore Architectures Performance Tools Slide Source: Michael Perrone, MIT Fall 2007 course Gnu-based tools Oprofile - System level profiler (only PPU) Gprof - generates call graphs IBM Tools Static analysis tool (spu_timing) annotates assembly file with scheduling and instruction issue estimates Dynamic analysis tool (CellBE system simulator) Can run your code on an X86 machine Can collect a variety of statistics

CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 README_build_env.txt (You should IMPORTANT!) Provides details on the build environment features, including files, structure and variables. make.footer Specifies all of the build rules needed to properly build binaries Must be included in all SDK Makefiles (referenced relatively if $CELL_TOP is not defined) Includes make.header make.header Specifies definitions needed to process the Makefiles Includes make.env make.env Specifies the default compilers and tools to be used by make make.footer and make.header should not be modified

CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 Defaults to gcc Set in make.env with three variables set to gcc or xlc PPU32_COMPILER PPU64_COMPILER PPU_COMPILER[overrides PPU32_COMPILER and PPU64_COMPILER] SPU_COMPILER Can change from the command line PPU_COMPILER=xlc SPU_COMPILER=xlc make make -e PPU64_COMPILER:=gcc -e PPU32_COMPILER:=gcc -e SPU_COMPILER:=gcc export PPU_COMPILER=xlc SPU_COMPILER=xlc ; make

CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 Use CELL_TOP or maintain relative directory structure ifdef CELL_TOP include $(CELL_TOP)/make.footer else include../../../make.footer endif

CISC 879 : Software Support for Multicore Architectures Makefile variables Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 DIRS list of subdirectories to build first PROGRAM_ppuPROGRAMS_ppu 32-bit PPU program (or list of programs) to build. PROGRAM_ppu64PROGRAMS_ppu64 64-bit PPU program (or list of programs) to build. PROGRAM_spuPROGRAMS_spu SPU program (or list of programs) to build. If written as a standalone binary, can run without being embedded in a PPU program.

CISC 879 : Software Support for Multicore Architectures Makefile variables (cont’d) Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 LIBRARY_embedLIBRARY_embed64 Creates a linked library from an SPU program to be embedded into a 32-bit or 64-bit PPU program. CC_OPT_LEVEL Optimization level for compiler to use CFLAGS, CFLAGS_gcc, CFLAGS_xlc Additional flags for compiler to use (general or specific to gcc/xlc) TARGET_INSTALL_DIR Specifies where built targets are installed

CISC 879 : Software Support for Multicore Architectures Sample Project Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

CISC 879 : Software Support for Multicore Architectures Next Time Chapters 1-3 NVIDIA CUDA Programming Guide version 1.1 And all of Chapter 29 from GPU Gems 2 Links on website