1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.

INSTRUCTION SET ARCHITECTURES

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Processor System Architecture

Computer Architecture and Data Manipulation Chapter 3.

Processor Technology and Architecture

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Lecture 5 Sept 14 Goals: Chapter 2 continued MIPS assembly language instruction formats translating c into MIPS - examples.

Chapter 12 CPU Structure and Function. Example Register Organizations.

PowerPC 601 Stephen Tam. To be tackled today Architecture Execution Units Fixed-Point (Integer) Unit Floating-Point Unit Branch Processing Unit Cache.

What is an instruction set?

1 RISC Machines l RISC system »instruction –standard, fixed instruction format –single-cycle execution of most instructions –memory access is available.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Lecture 17 Today’s Lecture –Instruction formats Little versus big endian Internal storage in the CPU: stacks vs. registers Number of operands and instruction.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix A Authors: John Hennessy & David Patterson.

CET 520/ Gannod1 The MIPS Architecture Section 2.12.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter Six Sun SPARC Architecture. SPARC Processor The name SPARC stands for Scalable Processor Architecture SPARC architecture follows the RISC design.

Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.

CSNB374: Microprocessor Systems Chapter 1: Introduction to Microprocessor.

Computer Architecture and Organization

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

Principles of Linear Pipelining

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

Computer Architecture Lecture 03 Fasih ur Rehman.

Chapter One Introduction to Pipelined Processors

Csci 136 Computer Architecture II – Summary of MIPS ISA Xiuzhen Cheng

Lecture 04: Instruction Set Principles Kai Bu

CS 211: Computer Architecture Lecture 2 Instructor: Morris Lancaster.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Chapter 10 Instruction Sets: Characteristics and Functions Felipe Navarro Luis Gomez Collin Brown.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Computer Organization Rabie A. Ramadan Lecture 3.

Overview von Neumann Architecture Computer component Computer function

Chapter 2 — Instructions: Language of the Computer — 1 Memory Operands Main memory used for composite data – Arrays, structures, dynamic data To apply.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Prof. Sirer CS 316 Cornell University

Single Clock Datapath With Control

The University of Adelaide, School of Computer Science

CS170 Computer Organization and Architecture I

The University of Adelaide, School of Computer Science

ECEG-3202 Computer Architecture and Organization

Computer Instructions

Computer Architecture

Computer Architecture

ECEG-3202 Computer Architecture and Organization

Prof. Sirer CS 316 Cornell University

Introduction to Microprocessor Programming

CPU Structure CPU must:

Presentation transcript:

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek

2/21 Background Joint collaboration of IBM/Sony/Toshiba (STI)  First implementation of the architecture in 2005 Develop a new/next-gen processor  Initially for Play Station 3  Others, multimedia application (Blu-ray, HDTV)  Server systems  Supercomputers

3/21 Synergistic Processing Element

4/21 Power Processor Element (PPE) The PPE is a 64 bit, "Power Architecture“  capable of running POWER or PowerPC binaries  Acts as the controller for the 8 SBEs

5/21 Element Interconnect Bus Connects various on chip elements  PPE, 8 SPEs, memory controller (MIC) & off-chip I/O interfaces Data-ring structure with control of a bus  4 unidirectional rings but 2 rings run counter direction to other 2  Worst-case maximum latency is only half distance of the ring Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3.2 GHz)

6/21 Synergistic Processing Elements An SPE is composed of a Synergistic Processing Unit and a Memory Flow controller.  SPU is a SIMD, RISC-based processor (3.2 GHZ)  SPU’s ISA a cross between VMX and the PS2’s Emotion Engine. Single Instruction Multiple Data (SIMD) organization  Multiple processing elements that perform the same operation on multiple data simultaneously. Statically scheduled (compiler plays big role)  Also no dynamic (branch) prediction hardware (relies on compiler generated hints) Each SPE consists of:  128 x 128 register  Local Store (SRAM)  DMA unit  FP, LD/ST, Permute, Branch Unit (each pipelined)

7/21 SPE Architecture Copyright: IBM

8/21 SPU Architecture Overview 128 General Purpose Registers (each 128 bits) Support for 16-bit (half-word) and 32-bit (word) signed Integers and 8-bit unsigned Integers. Support for single-precision (32-bit) and double-precision (64-bit) floating-point data. No condition register. Local storage. SPU load/store transfers quad-words between GPRs and storage. Storage size can vary but address space limited to 4 GB. Channel interface to external devices. GPRs  channel interface  Up to 128 channels Supports up to 128 special-purpose registers

9/21 Data Layout in Registers The leftmost word (bytes 0, 1, 2, and 3) of a register is called the preferred slot. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. A set of store assist instructions is available to help store bytes, halfwords, words, and doublewords.

10/21 SPE Local Store Each SPE has local on-chip memory a.k.a Local Store(LS)  Instruction and Data store  Visible to PPE and can be addressed directly  Does not operate like cache Data/instructions are transferred between LS and system memory/other SPE’s LS using DMA unit  128 bytes at a time(transfer rate of 0.5 terabytes/sec)  DMA transactions are coherent

11/21 SPU ISA Instructions 32 Bits in length 6 basic instruction formats RR Instruction Format: RRR Instruction Format: RI7 Instruction Format:

12/21 SPU ISA Instructions (cont) RI10 Instruction Format: RI16 Instruction Format: RI18 Instruction Format:

13/21 Types of Instructions Memory – Load/Store Constant-Formation Integer and Logical Shift and Rotate Compare, Branch, Halt Hint for Branch Floating Point Control Channel

14/21 Memory – Load/Store Instructions Size of local storage address space is (up to) 2^32 bytes = 4GB Local storage is byte-addressed Load/Store inst combine operands from one or two regs and/or an immediate value to form the effective addr of the memory operand. Only aligned 16-byte-long quadwords can be loaded and stored. Therefore, the right-most 4 bits of an effective address are always ignored and are assumed to be zero.

15/21 Memory – Load/Store Instructions Example: Load Quad-word (RR format)

16/21 Constant-Formation Instruction Immediate Load Word Loads immediate values to target register Example:

17/21 Integer and Logical Instructions Add Word Full compliment of arithmetic functions ex. Add, Subtract, Multiply, Generate carry, Generate borrow, Average, Sum, … Logical functions: And, Or, XOR, Nand, Nor, Equivelent, … Both Reg and Immediate instruction formats Examples:

18/21 Integer and Logical Instructions (cont) And

19/21 Shift and Rotate Instructions Shift Left halfword

20/21 Shift and Rotate Instructions Rotate halfword

21/21 Compare, Branch, and Halt Instructions Conditional Branch -No condition code register -Utilize GPR value usually set by a compare instruction Register value set to all 1’s for all 0’s based on compare result -Logical compare instructions treat the operands as unsigned integers Halt instructions -Stops execution when tested condition is met -The stop is not precise. As a result, execution cannot generally be restarted.

22/21 Compare, Branch, and Halt Instructions (cont) Compare Equal Word Branch if not Zero Word

23/21 Compare, Branch, and Halt Instructions (cont) Halt If Greater Than

24/21 SPU ISA Purpose is to achieve high performance on critical workloads for game, media, and broadband systems. Key SPU Workloads: Graphics pipeline which includes subdivision and rendering. Stream processing, which includes encoding, decoding, encryption, and decryption Modeling, witch includes game physics Implementations of the SPU ISA achieve better performance to cost ratios than general-purpose processors because the SPU ISA implementations require half the power and half the chip area for equivalent performance.

25/21 SPU ISA and the 4 Principles 1.Simplicity favors regularity -All instructions are the same length. All Immediate instructions follow a similar format (fields in a common location). Register-type instructions can vary in format depending on number of registers used. -Register block is 128x128(bit) 2.Smaller is faster -Large number of GPRs and SPRs -32-bit instructions 3.Make the common case fast -Single precision floating point calculations 4.Good design demands good compromises - Large register size facilitates SIMD computations

26/21 Summary (of Cell) Cell processor architecture is optimized for digital media and entertainment Facilitating convergence between supercomputing and entertainment – desire for realism. Enables new classes of applications.

27/21 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory

28/21 Compiling and Binding of a program on CELL Copyright: IBM

29/21 Questions?