1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula
Configurable System-on-Chip: Xilinx EDK
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
ECE 424 Embedded Systems Design Lecture 8 & 9 & 10: Embedded Processor Architecture Chapter 5 Ning Weng.
03/05/2008CSCI 315 Operating Systems Design1 Memory Management Notice: The slides for this lecture have been largely based on those accompanying the textbook.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Pipelining By Toan Nguyen.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Agenda Performance highlights of Cell Target applications
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Main Memory. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Hello world !!! ASCII representation of hello.c.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 8: Main Memory.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Computer Organization Exam Review CS345 David Monismith.
Computer Architecture & Operations I
Cell Architecture.
Many-core Software Development Platforms
Introduction to Computer Systems
Multicore and GPU Programming
Presentation transcript:

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

2/21 Background Joint collaboration of IBM/Sony/Toshiba Develop a new/next-gen processor  Initially for Play Station 3  Others, multimedia application (Blu-ray, HDTV)  Server systems

3/21 Objective Outstanding performance  Overcome memory wall  Improve power efficiency  Sustain high frequency without increase in pipeline depth Real-time response to user  Visual, sound & other sensory feedback  Connect to internet (able to handle variety of workloads) Applicable for wide range of platforms  Next-generation consumer –electronic systems & beyond

4/21 Synergistic Processing Element

5/21 Power Processor Element (PPE) The PPE is a 64 bit, "Power Architecture“  capable of running POWER or PowerPC binaries  Extended Vector Scalar Unit (VSU) The PPE is  In-order  Dual threaded  Dual Issue

6/21 PPE components Copyright: IBM

7/21 Synergistic Processing Elements An SPE is a self contained vector processor (SIMD) which acts as a co-processor  SPE’s ISA a cross between VMX and the PS2’s Emotion Engine. In-order (again to minimize circuitry to save power) Statically scheduled (compiler plays big role)  Also no dynamic prediction hardware (relies on compiler generated hints) Each SPE consists of:  128 x 128 register  Local Store (SRAM)  DMA unit  FP, LD/ST, Permute, Branch Unit (each pipelined)

8/21 SPE Architecture Copyright: IBM

9/21 SPE Local Store Each SPE has local on-chip memory a.k.a Local Store(LS)  serves a secondary register file (not as cache)  Avoids coherence logic needed caches as well cache miss penalty  Is mapped into memory map of the processor allow LS to LS transactions 128 bit instruction fetch, load and store operation  7 out of every 8 cycles Data/instructions are transferred bet. LS and system memory/other SPE’s LS using DMA unit  128 bytes at a time(transfer rate of 0.5 terabytes/sec)  DMA transactions are coherent

10/21 SPE DMA Unit Contains the Memory Flow Controller(MFC) Interface uses Power Architecture page protection model  MFC has its own Memory Management Unit (MMU) that is subset of Power core’s MMU  This allows consistent interface to system storage map for all processors despite it heterogeneous structure

11/21 Floating Point Performance Both PPE and SPE have Vector instruction capability Esp. each SPU can complete  2 double precision operations per clock cycle - translates to 6.4 GFLOPS at 3.2 GHz OR  8 single precision operations per clock cycle – translates to 25.6 GFLOPS at 3.2 GHz

12/21 Element Interconnect Bus Connects various on chip elements  PPE, 8 SPEs, memory controller (MIC) & off-chip I/O interfaces Data-ring structure with control of a bus  4 unidirectional rings but 2 rings run counter direction to other 2  Worst-case maximum latency is only half distance of the ring Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3.2 GHz)

13/21 Memory and I/O Cell needs tremendous amount of memory and I/O Memory Technology: Rambus XDR DRAM  Supports total bandwidth of 25.6 GB/s I/O: Rambus FlexI/O

14/21 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory

15/21 IBM Approach Manually partition the application into separate code segments and use the compiler that targets the appropriate ISA For SPUs, SIMD code generation can be done by parallelizing compiler with auto-SIMDization Allocating SPE program data in system memory (shared memory view) & have SPE compiler automatically manage the movement of data  A naive compiler inserts an explicit DMA transfer for each access to shared memory  optimized: employ a software cache mechanism that permits reuse of the temporary buffers in the LS

16/21 IBM Approach (contd..) Using the SPE linker and an embedding tool  generate a PPE executable that contains the SPE binary embedded within the data section PPE object is then linked, using a PPE linker  with the runtime libraries which are required for thread creation and management, to create a bound executable for the Cell BE program

17/21 Compiling and Binding of a program on CELL Copyright: IBM

18/21 Programming Models Stream processing  Serial or parallel pipelines can be setup  Example: Set-box consists of reading, video and audio encoding, and display. Serial: chaining SPEs and each SPE does one subtask Parallel: partition same subtask among SPEs

19/21 Programming Model Function Offload Model Application executes on PPE Complex library functions invoked by the main application are offloaded onto one or more SPE Library function(s) are optimized and recompiled for SPE environment SPE executable program is linked into PPE object module as small remote function invocation stub

20/21 Current/Future Applications Sony Play Station 3  Significant improvement over PS2 IBM Blade Server  Blade server prototype containing two cell processors  Ran at 2.4 GHz (current system run at 3.2 GHz) providing 200 GFLOPS single-precision floating performance per CPU Mercury  In corporate cell based system into Military Vehicles  Used for target recognition, tracking geo-location, mapping, video processing etc

21/21