Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

Chapter 6 Computer Architecture

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Memory INF5062: Programming Asymmetric Multi-Core Processors 1 August 2015.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Agenda Performance highlights of Cell Target applications

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Sunpyo Hong, Hyesoon Kim

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

My Coordinates Office EM G.27 contact time:

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Cell Architecture.

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Mattan Erez The University of Texas at Austin

Large data arrays processing on Cell Broadband Engine

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Cell Broadband Engine

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo SPE PPE MIC EIB Cell Broadband Engine Structure PPE - Power Processing Element SPE - Synergistic Processing Element MIC - Memory Interface Controller EIB - Element Interconnect Bus

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo SPE PPE MIC EIB  4 rings  Each 128 bytes wide (128 bytes concurrently in transfer)  Separate send and receive “ramps” SPE

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo SPE PPE MIC EIB MIC  MIC −One or two RAMBUS XDR data interfaces −Read or write lengths: 1, 2,..., 8, 16, 32, 64 or 128 bytes −64 entry read queue, 64 entry write queue −Up to 64 GB XDR MIC

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure  PPE −Power processing element  PPU −Power processing unit −32 kB L1 instruction cache −32 kB L2 data cache  PPSS −PowerPC processor storage subsystem −512 kB unified L2 cache −EIB accesses are sequential and ordered PPE PPU PPSS L1 data cache L1 instr cache L2 cache ≤32 byte ≤16 byte

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure  PPU −2-way symmetric multithreading  Duplicated register set −32 64-bit general purpose registers −32 64-bit floating point registers − bit vector registers −1 link register for branching −1 count register for loop count or branching PPE PPU PPSS L1 data cache L1 instr cache L2 cache ≤32 byte ≤16 byte integer PPU registers cond link count exp GP 32 float FP 32 ctrl vector 32 ctrl sav/rstr

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure  SPE −Synergetic processing element  SPU −Synergetic processing unit −4 execution units (for 4 hardware contexts) −RF: register file −LS: local store  MFC: Memory flow controller −MMIO registers: memory mapped I/O registers −DMA controller SPE SPU MFC... RF LS MMIO registers DMA controller

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure  Register file −128 registers −Each 128 bits wide  Local store −256 KB −128 bit-aligned access −transparent (but less efficient) read-modify-write cycle for unaligned access −Access times and priority 1.MMIO 16 bytes/cycle 2.DMA 128 bytes/cycle 3.Data fetch 16 bytes/cycle 4.Instr. fetch 128 bytes/cycle SPE SPU MFC... RF LS MMIO registers DMA controller

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure  DMA controller −Can have 16 DMA operations concurrently in flight, more than normal CPUs (i.e. 128 total) −Transfer size up to 16 KB for each operation  Mailboxes −32-bit dedicated registers −Implemented as channels −Used for communication with PPE −2 outgoing mailboxes −1 incoming mailbox SPE SPU MFC... RF LS MMIO registers DMA controller

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo SPE PPE MIC EIB Cell Broadband Engine Structure  SPE −Each SPE’s memory controller can have 16 DMA operations concurrently in flight, more than normal CPUs (i.e. 128 total)  Observation −SPEs can manage memory accesses without PPE help

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB mem Flat address space of the PPE mem  either −64 bit flat address space −no mapping of local store to PPE address space  or −virtual addressing −42 bit real address space independently set up for data and instruction addressing −64 bit effective address space −65 bit virtual address space  256 MB segment sizes −Mappable from 65-bit space to 42- bit space  4kB and 2 of (16kB, 1MB, 16MB) page sizes −swappable

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB mem Flat address space of the PPE mem

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo LS Access Methods Main storage (effective address space) DMA requests can be sent to an MFC either by software on its associated SPU or on the PPE, or by any other processing device that has access to the MFC's MMIO problem-state registers

nVIDIA

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Kinds of memory  Constant memory cache, texture memory cache, local memory, shared memory and registers are on chip  Texture memory, constant memory and global memory are in device memory  Reads from texture memory and constant memory are cached  Reads from and writes to local and global memory are not cached L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers...

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Kinds of memory  Coherency mechanisms: no such thing  Global, shared −no, explicit synchronization  Constant, texture −is read-only, 4 cycles per read, broadcast capable  Local −isn’t shared  Global −not cached L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers...

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Threads per SM  accessing a register is zero extra clock cycles per instruction −delays may occur due to register read- after-write dependencies and register memory bank conflicts  delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them  thread scheduler works best when the number of threads per block is a multiple of 64 L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers...

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Threads per SM L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers KB 8 KB 8192 registers/SM 8 KB 16 KB

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Multiprocessor facts  The maximum number of threads per block is 512  The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512, and 64, respectively  The maximum size of each dimension of a grid of thread blocks is  The warp size is 32 threads  The maximum number of active blocks per multiprocessor is 8  The maximum number of active warps per multiprocessor is 24  The maximum number of active threads per multiprocessor is 768

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Threads per SM  accessing a register is zero extra clock cycles per instruction −delays may occur due to register read- after-write dependencies and register memory bank conflicts  delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them  thread scheduler works best when the number of threads per block is a multiple of 64 L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers...

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Performance  Speed Host - device memory −PCIe x16 : 3.2 GB/s  Speed device memory - shared memory −80 GB/s  Counted by cycles  Equivalent to clock speed of the card − 1.4 GHz on the GeForce GTX 280 GPU

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Shared memory hints  Shared memory −amount available per multiprocessor is 16 KB −organized into 16 banks − 32 bits readable from each bank in 2 clock cycles − concurrent access by several threads of a warp leads to a bank conflict − avoid high-degree bank conflicts

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Shared memory  Hundreds of times faster than global memory  Threads can cooperate via shared memory  Use one / a few threads to load / compute data shared by all threads  Use it to avoid non-coalesced access − Stage loads and stores in shared memory to re-order non- coalesceable addressing

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo BADGOOD

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo BAD

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Coalescing  Coalescing −fast, bursty transfer between multiple registers allocated to different threads in the same warp − if per-thread memory accesses for a single warp (today’s hardware actual half-warp) form a contiguous range of addresses, accesses will be coalesced into a single access − is not a dynamic ability − compiler must identify it and generate appropriate code − occurs when thread n accessed “ base-address + size * n ”, where base-address is aligned to 16 * size size of size of the read operation: 4, 8 or 16 bytes

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Memory latency and features  Read from global memory −16-bit aligned, otherwise read-modify-write −32-bit, 64-bit, 128-bit read operations supported  Coalesced read −If all threads in a warp read concurrently in a pattern, reads are “coalesced” −coalesced read of 32 bits per thread is slighter faster than coalesced read of 64 bits per thread and much faster than coalesced read of 128 bits per thread −but all are warpsize times faster than non-coalesced reads

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Device (global) Memory hints  Device memory − coalesced vs. non-coalesced access times differ by an order of magnitude  Global/Local device memory

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Memory access  Access to device memory is slow −4 cycles to set up, 400 to 600 cycles to complete  Use multithreading to use that time  Partition your computation to keep the GPU multiprocessors equally busy − Many threads, many thread blocks  Keep resource usage low enough to support multiple active thread blocks per multiprocessor − Registers, shared memory

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Texture memory  Texture memory − optimize for spatial locality to increase cache performance  Features and advantages −Reads are cached, useful if accesses have locality −No coalescing or bank conflict problems −Address calculation latency is hidden −Packed data can be delivered to several registers in a single operation −8-bit and 16-bit integer data may be converted to 32-bit floating on the fly L1 instr. cache Constant mem cache Texture mem cache Shared memory Constant memory Texture memory Global (device) memory SP local mem registers SP local mem registers...

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Texture memory raw facts  cache working set is 8 KB per multiprocessor  texture reference bound to a one-dimensional CUDA array: maximum width 213  texture reference bound to a two-dimensional CUDA array: maximum width 216 maximum height 215  texture reference bound to linear memory: maximum width 227

INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Optimization priorities  Memory coalescing has highest priority − Largest effect − Optimize for locality  Take advantage of shared memory − Very high bandwidth − Threads can cooperate to save work  Use parallelism efficiently − Keep the GPU busy at all times − High arithmetic / bandwidth ratio − Many threads & thread blocks  Leave bank conflicts for last − 4-way and smaller conflicts are not usually worth avoiding if avoiding them will cost more instructions