J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.
Agenda Performance highlights of Cell Target applications
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core architectures. Single-core computer Single-core CPU chip.
High Performance Computing on the Cell Broadband Engine
Multi-Core Architectures
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Lecture on Central Process Unit (CPU)
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High performance computing architecture examples Unit 2.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Cell Architecture.
Morgan Kaufmann Publishers
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development Vol. 49, No. 4/5, Pg. 589 (Jul-Sep 2005) Presented by John Ingalls ECE April 8, 2010

 ISA: 64-bit IBM Power Architecture with SIMD.  1 PPE, 8 SPEs, 1 memory and 1 I/O controller all on coherent bus (single address space).  PowerPE: 2-issue in-order 2- thread-SMT, 32KB L1 I$/D$, 512KB L2$ with software management hooks, 128-bit total SIMD width, separate Vector/SIMD issue queue from scalar execute.

 SynergisticPE: in-order SIMD. 128-bit total width, like PPE.  Local Store (LS): 256KB, single port for either 128-bit SIMD-word access, or 128- byte insns fetch or DMA I/O.  128-entry regfile for static (compiler) insn reordering  area efficient: 15% control, rest is Execute & Local Store

 I/O supports direct connection to another Cell to easily build a cache-coherent multiprocessor.  Native binary compatibility with Power-ISA apps.  Modular design, but still fully custom.  Extensive test and monitoring circuitry.

 Challenges:  SPE Local Store is software managed.  Each SPE supports one thread context, and context switches are expensive.  Models:  Function Offload: function call from PPE  Device Extension: SPE isolated, like a device  Compute Acceleration: PPE aggregates SPE results  Streaming: each SPE is a step in software pipeline  Shared Memory Multiprocessor: conventional  Asymmetric Thread Runtime: p-threads

 Paper is easy to follow and doesn’t throw too much complicated stuff at reader.  Built and shipped on time by a joint venture of IBM, Sony, and Toshiba.  Many applications in media and supercomputing.  They keep listing static limitations imposed by their models as advantages, such as explicitly managed caches.  No hard performance data or comparison to competition. Only “anecdotal evidence” shows that it is possible to fully utilize Cell.

 Keywords:  Heterogeneous multi-core SIMD processor.  Single address space across all cores on chip  1x conventional PPE for control.  8x SPEs for streaming SIMD are very fast and power efficient if used.  Several programming models are feasible.  Questions:  How could the programming models be easier?  What direction should this architecture grow in?