Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Categories of I/O Devices
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.
Agenda Performance highlights of Cell Target applications
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Extracted directly from:
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Organization & Assembly Language © by DR. M. Amer.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Playstation2 Architecture Architecture Hardware Design.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Presented by: Nick Kirchem Feb 13, 2004
Cell Architecture.
Assembly Language for Intel-Based Computers, 5th Edition
Cache Memory Presentation I
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cell Broadband Processor Daniel Bagley Meng Tan

Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical discussion of components  Design choices  Other processors like the cell  Programming for the cell

History of Development  Sony Playstation2 Announce March 1999Announce March 1999 Released March 2000 in JapanReleased March 2000 in Japan 128bit “Emotion Engine”128bit “Emotion Engine” 294mhz, MIPS CPU294mhz, MIPS CPU Single Precision FP OptimizationsSingle Precision FP Optimizations 6.2gflops6.2gflops

History Continued  Partnership between Sony, Toshiba, IBM  Summer of 2000 – High level development talks  Initial goal of 1000x PS2 Power  March 2001, Sony-IBM-Toshiba design center opened  $400m investment.

Overall Goals for Cell  High performance in multimedia apps  Real time performance  Power consumption  Cost  Available by 2005  Avoid memory latency issues associated with control structures

The Cell itself  Power PC based main core (PPE)  Multiple SPEs  On die memory controller  Inter-core transport bus  High speed IO

Cell Die Layout

Cell Implementation  Cell is an architecture  Preliminary PS3 Implementation 1 PPE1 PPE 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase) 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process Clocked at 3-4ghzClocked at 3-4ghz 256GFLOPS Single 4ghz256GFLOPS Single 4ghz

Why a Cell Architecture  Follows a trend in computing architecture  Natural extension of dual and multi- core  Extremely low hardware overhead  Software controllable  Specialized hardware more useful for multimedia

Possible Uses  Playstation3 (Obviously)  Blade servers (IBM) Amazing single precision FP performanceAmazing single precision FP performance Scientific applicationsScientific applications  Toshiba HDTV products

Power Processing Element  PowerPC instruction set with AltiVec  Used for general purpose computing and controlling SPE’s  Simultaneous Multithreading  Separate 32 KB L1 Caches and unified 512 KB L2 Cache

PPE (cont.)  Slow but power efficient PowerPC instruction set implementation  Two issue in-order instruction fetch  Conspicuous lack of instruction window  Compare to conventional PowerPC implementations (G5)  Performance depends on SPE utilization

Synergistic Processing Element (SPE)  Specialized hardware  Meant to be used in parallel (7 on PS3 implementation)(7 on PS3 implementation)  On chip memory (256kb)  No branch prediction  In-order execution  Dual issue

SPE Architecture  0.99µm2 on 90nm Process  128 registers (128 bits wide) Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit  Variant of VMX instruction set Modified for 128 registersModified for 128 registers  On chip memory is NOT a cache

SPE Execution  Dual issue, in-order  Seven execution units  Vector logic  8 single precision operations per cycle  Significant performance hit for double precision

SPE Execution Diagram

SPE Local Storage Area  NOT a cache  256kb, 4 x 64kb ECC single port SRAM  Completely private to each SPE  Directly addressable by software  Can be used as a cache, but only with software controls  No tag bits, or any extra hardware

SPE LS Scheduling  Software controlled DMA  DMA to and from main memory  Scheduling a HUGE problem Done primarily in softwareDone primarily in software IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally  Request queue handles 16 simultaneous requests Up to 16 kb transfer eachUp to 16 kb transfer each Priority: DMA, L/S, FetchPriority: DMA, L/S, Fetch  Fetch / execute parallelism

SPE Control Logic  Very little in comparison  Represents shift in focus  Complete lack of branch prediction Software branch predictionSoftware branch prediction Loop unrollingLoop unrolling 18 cycle penalty18 cycle penalty  Software controlled DMA

SPE Pipeline  Little ILP, and thus little control logic  Dual issue  Simple commit unit (no reorder buffer or other complexities)  Same execution unit for FP/int

SPE Summary  Essentially small vector computer  Based on Altivec/VMX ISA Extensions for DMA and LS managementExtensions for DMA and LS management Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile  Uniquely suited for real time applications  Extremely fast for certain FP operations  Offload a large amount on to compiler / software.

Element Interconnect Bus  4 concentric rings connecting all Cell elements  128-bit wide interconnects

EIB (cont.)  Designed to minimize coupling noise  Rings of data traveling in alternating directions  Buffers and repeaters at each SPE boundary  Architecture can be scaled up with increased bus latency

EIB (cont.)  Total bandwidth at ~200GB/s  EIB controller located physically in center of chip between SPE’s  Controller reserves channels for each individual data transfer request  Implementation allows for SPE extension horizontally

Memory Interface  Rambus XDR memory to keep Cell at full utilization  3.2 Gbps data bandwidth per device connected to XDR interface  Cell uses dual channel XDR with four devices and 16-bit wide buses to achieve 25.2 GB/s total memory bandwidth

Input / Output Bus  Rambus FlexIO Bus  IO interface consists of 12 unidirectional byte lanes  Each lane supports 6.4 GB/s bandwidth  7 outbound lanes and 5 inbound lanes

Design Choices  In-order execution Abandoning ILPAbandoning ILP ILP – 10-20% increase per generationILP – 10-20% increase per generation Reducing control logicReducing control logic Real time responsivenessReal time responsiveness  Cache Design Software configuration on SPESoftware configuration on SPE Standard L2 cache on PPEStandard L2 cache on PPE

Cell Programming Issues  No Cell compiler in existence to manage utilization of SPE’s at compile time  SPE’s do not natively support context switching. Must be OS managed.  SPE’s are vector processors. Not efficient for general-purpose computation.  PPE’s and SPE’s use different instruction sets.

Cell Programming (cont.)  Functional Offload Model  Simplest model for Cell programming  Optimize existing libraries for SPE computation  Requires no rebuild of main application logic which runs on PPE

Cell Programming (cont.)  Device Extension Model  Take advantage of SPE DMA  Use SPE’s as interfaces to external devices

Cell Programming (cont.)  Computational Acceleration Model  Traditional super-computing methods using Cell  Shared memory or message passing paradigm for accelerating inherently parallel math operations  Can overwrite intensive math libraries without rewriting applications

Cell Programming (cont.)  Streaming model  Use Cell processor as one large programmable pipeline  Partition algorithms into logically sensible steps. Execute each separately, in serial, on separate processors.

Cell Programming (cont.)  Asymmetric Thread Runtime Model  Abstract Cell architecture away from programmer.  Use OS to use processors to each run different threads.

Sample Performance  Demonstration physics engine for real-time game  hitepapers/cell_online_game.pdf hitepapers/cell_online_game.pdf hitepapers/cell_online_game.pdf  182 Compute to DMA ratio on SPE’s  For the right tasks, Cell architecture can be extremely efficient.