Download presentation
Presentation is loading. Please wait.
Published byAlaina Arnold Modified over 9 years ago
1
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy
3
History Jointly designed by Sony, Toshiba,IBM (STI) Design began March 2001 First used in Sony’s PlayStation 3 IBM’s Roadrunner cluster contains over 12,000 Cell processors. IBM Roadrunner cluster
4
Cell Broadband Engine Nine cores One Power Processing Element (PPE) Main processor Eight Synergistic Processing Elements (SPE) Fully functional co-processors Comprised of Synergistic Processing Unit (SPU), Memory Flow Controller (MFC) Stream processing
5
Power Processing Element In-order dual-issue design 64-bit Power Architecture Two 32 KB L1 caches (instruction, data), one 512 KB L2 cache Instruction Unit: instruction fetch, decode, branch, issue, completion 4 instructions per cycle per thread into buffer dispatches instructions from buffer dual-issued to Execution Unit Branch prediction: 4-KB x 2-bit branch history table
6
Pipeline depth: 23 stages
7
Synergistic Processing Element Implements new instruction-set architecture Each SPU contains a dedicated DMA management queue 256 KB local store memory – Stores instructions and data – Memory transferred via DMA between local and system memory No data load / branch prediction. – Relies on "prepare-to-branch" instructions to pre-fetch data – Loads at least 17 instructions at the branch target address Two instructions per cycle – 128-bit SIMD – In-order dual-issue statically scheduled
9
On-chip Interconnect : Element Interconnect Bus (EIB) Provides internal connection for 12 ‘units’: PPE 8 SPEs Memory Controller (MIC) 2 Off-chip I/O interfaces Each ‘unit’ has one 16B read port and one 16B write port Circular ring Four 16-byte-wide unidirectional channels which counter-rotate in pairs
11
Includes an arbitration unit which functions as a set of traffic lights. Runs at half the system clock rate Peak instantaneous EIB bandwidth is 96B per clock 12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer EIB channel is not permitted to convey data requiring more than six steps
12
Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. Maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system The theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s 197 GB/s Actual peak data bandwidth achieved
13
David Krolak explains: “Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”
14
Multi-threading Organization PPE is an in-order, 2-way Simultaneous Multi-Threading (SMT) Each SPU is a vectorial accelerator targeted at the execution of SIMD code All architecture states are duplicated to perform interleaved instruction issuing. Asynchronous DMA transfers. The setup of a DMA takes the SPE a few cycles whereas a cache miss on a normal system causes the CPU to stall to up to thousands of cycles. SPEs can perform other calculations while waiting for data.
16
Scheduling Policy Two classes of threads defined PPU threads: run on the PPU SPU tasks: run on the SPUs. PPU threads are managed by the Completely Fair Scheduler (CFS) SPU scheduler supports time-sharing in multi-programmed workloads Allows preemption of SPU tasks Cell-based systems allow only one active application to run at the same time to avoid performance degradation.
17
Completely Fair Scheduler Ranked by Consider an example with two users, A and B, who are running jobs on a machine. User A has just two jobs running, while user B has 48 jobs running. Group scheduling enables CFS to be fair to users A and B, rather than being fair to all 50 jobs running in the system. Both users get a 50-50 share. B would use his 50% share to run his 48 jobs and would not be able to encroach on A's 50% share.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.