1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek

2/21 Background Joint collaboration of IBM/Sony/Toshiba (STI)  First implementation of the architecture in 2005 Develop a new/next-gen processor  Initially for Play Station 3  Others, multimedia application (Blu-ray, HDTV)  Server systems  Supercomputers

3/21 Synergistic Processing Element

4/21 Power Processor Element (PPE) The PPE is a 64 bit, "Power Architecture“  capable of running POWER or PowerPC binaries  Acts as the controller for the 8 SBEs

5/21 Element Interconnect Bus Connects various on chip elements  PPE, 8 SPEs, memory controller (MIC) & off-chip I/O interfaces Data-ring structure with control of a bus  4 unidirectional rings but 2 rings run counter direction to other 2  Worst-case maximum latency is only half distance of the ring Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3.2 GHz)

6/21 Synergistic Processing Elements An SPE is composed of a Synergistic Processing Unit and a Memory Flow controller.  SPU is a SIMD, RISC-based processor (3.2 GHZ)  SPU’s ISA a cross between VMX and the PS2’s Emotion Engine. Single Instruction Multiple Data (SIMD) organization  Multiple processing elements that perform the same operation on multiple data simultaneously. Statically scheduled (compiler plays big role)  Also no dynamic (branch) prediction hardware (relies on compiler generated hints) Each SPE consists of:  128 x 128 register  Local Store (SRAM)  DMA unit  FP, LD/ST, Permute, Branch Unit (each pipelined)

7/21 SPE Architecture Copyright: IBM

8/21 SPU Architecture Overview 128 General Purpose Registers (each 128 bits) Support for 16-bit (half-word) and 32-bit (word) signed Integers and 8-bit unsigned Integers. Support for single-precision (32-bit) and double-precision (64-bit) floating-point data. No condition register. Local storage. SPU load/store transfers quad-words between GPRs and storage. Storage size can vary but address space limited to 4 GB. Channel interface to external devices. GPRs  channel interface  Up to 128 channels Supports up to 128 special-purpose registers

9/21 Data Layout in Registers The leftmost word (bytes 0, 1, 2, and 3) of a register is called the preferred slot. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. A set of store assist instructions is available to help store bytes, halfwords, words, and doublewords.

10/21 SPE Local Store Each SPE has local on-chip memory a.k.a Local Store(LS)  Instruction and Data store  Visible to PPE and can be addressed directly  Does not operate like cache Data/instructions are transferred between LS and system memory/other SPE’s LS using DMA unit  128 bytes at a time(transfer rate of 0.5 terabytes/sec)  DMA transactions are coherent

11/21 SPU ISA Instructions 32 Bits in length 6 basic instruction formats RR Instruction Format: RRR Instruction Format: RI7 Instruction Format:

12/21 SPU ISA Instructions (cont) RI10 Instruction Format: RI16 Instruction Format: RI18 Instruction Format:

13/21 Types of Instructions Memory – Load/Store Constant-Formation Integer and Logical Shift and Rotate Compare, Branch, Halt Hint for Branch Floating Point Control Channel

14/21 Memory – Load/Store Instructions Size of local storage address space is (up to) 2^32 bytes = 4GB Local storage is byte-addressed Load/Store inst combine operands from one or two regs and/or an immediate value to form the effective addr of the memory operand. Only aligned 16-byte-long quadwords can be loaded and stored. Therefore, the right-most 4 bits of an effective address are always ignored and are assumed to be zero.

15/21 Memory – Load/Store Instructions Example: Load Quad-word (RR format)

16/21 Constant-Formation Instruction Immediate Load Word Loads immediate values to target register Example:

17/21 Integer and Logical Instructions Add Word Full compliment of arithmetic functions ex. Add, Subtract, Multiply, Generate carry, Generate borrow, Average, Sum, … Logical functions: And, Or, XOR, Nand, Nor, Equivelent, … Both Reg and Immediate instruction formats Examples:

18/21 Integer and Logical Instructions (cont) And

19/21 Shift and Rotate Instructions Shift Left halfword

20/21 Shift and Rotate Instructions Rotate halfword

21/21 Compare, Branch, and Halt Instructions Conditional Branch -No condition code register -Utilize GPR value usually set by a compare instruction Register value set to all 1’s for all 0’s based on compare result -Logical compare instructions treat the operands as unsigned integers Halt instructions -Stops execution when tested condition is met -The stop is not precise. As a result, execution cannot generally be restarted.

22/21 Compare, Branch, and Halt Instructions (cont) Compare Equal Word Branch if not Zero Word

23/21 Compare, Branch, and Halt Instructions (cont) Halt If Greater Than

24/21 SPU ISA Purpose is to achieve high performance on critical workloads for game, media, and broadband systems. Key SPU Workloads: Graphics pipeline which includes subdivision and rendering. Stream processing, which includes encoding, decoding, encryption, and decryption Modeling, witch includes game physics Implementations of the SPU ISA achieve better performance to cost ratios than general-purpose processors because the SPU ISA implementations require half the power and half the chip area for equivalent performance.

25/21 SPU ISA and the 4 Principles 1.Simplicity favors regularity -All instructions are the same length. All Immediate instructions follow a similar format (fields in a common location). Register-type instructions can vary in format depending on number of registers used. -Register block is 128x128(bit) 2.Smaller is faster -Large number of GPRs and SPRs -32-bit instructions 3.Make the common case fast -Single precision floating point calculations 4.Good design demands good compromises - Large register size facilitates SIMD computations

26/21 Summary (of Cell) Cell processor architecture is optimized for digital media and entertainment Facilitating convergence between supercomputing and entertainment – desire for realism. Enables new classes of applications.

27/21 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory

28/21 Compiling and Binding of a program on CELL Copyright: IBM

29/21 Questions?

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Similar presentations

Presentation on theme: "1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Similar presentations

Presentation on theme: "1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek."— Presentation transcript:

Similar presentations

About project

Feedback