Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Slides:

Advertisements

Similar presentations

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Advertisements

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

Computer Abstractions and Technology

The University of Adelaide, School of Computer Science

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Basics and Architectures

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

High Performance Linear Transform Program Generation for the Cell BE

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

High Performance Computing on the Cell Broadband Engine

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

RAM, PRAM, and LogP models

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Chapter 2: Data Manipulation

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

EECE571R -- Harnessing Massively Parallel Processors ece

Microarchitecture.

Morgan Kaufmann Publishers

Vector Processing => Multimedia

Array Processor.

STUDY AND IMPLEMENTATION

Coe818 Advanced Computer Architecture

Large data arrays processing on Cell Broadband Engine

Multicore and GPU Programming

Presentation transcript:

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported by the Department of Energy’s Office of Science Office of Advanced Scientific Computing Research Programming the Cell Processor: Achieving High Performance and Efficiency

2 Meredith_Cell_SC07  One POWER architecture processing element (PPE)  Eight synergistic processing elements (SPEs)  All connected through a high-bandwidth element interconnect bus (EIB)  Over 200 gigaflops (single precision) on one chip Cell broadband engine processor: An overview

3 Meredith_Cell_SC07 Cell broadband engine processor: Details Eight SPEs  Dual-threaded  Vector instructions  Dual-issue pipeline  Simple instruction set heavily focused on single instruction multiple data (SIMD)  Capability for double precision, but optimization for single  Uniform 128-bit 128-register file  256-K fixed-latency local store  Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores One 64-bit PPE

4 Meredith_Cell_SC07 Genetic algorithm, traveling salesman: Single- vs. double-precision performance  The cell processor has a much higher latency for double-precision results.  The “ if ” test in the sorting predicate is highly penalized for double precision.

5 Meredith_Cell_SC07 Genetic algorithm, Ackley’s function: Using SPE-optimized math libraries Ackley’s function involves cos, exp, sqrt OriginalFast cosine Fast exp/sqrt SIMD Runtime (sec) [log scale] Switching to a math library optimized for SPEs results in a more than 10x improvement for single precision

6 Meredith_Cell_SC07  The cell processor can overlap communication with computation.  Covariance matrix creation has a low ratio of communication to computation.  However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible ScalarSIMD Optimizations Runtime using 1 SPE (sec) Synchronous DMA Overlapped DMA 6 Meredith_Cell_CS07 Covariance matrix creation: DMA communication overhead

7 Meredith_Cell_SC07 Stochastic Boolean SAT solver: Hiding latency in logic-intensive apps PPE: attempting to hide latency manually works against the compiler SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible

8 Meredith_Cell_SC07 Support vector machine: Parallelism and concurrent bandwidth  As more simultaneous SPE threads are added, the total runtime decreases.  Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE.  Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead. Full execution Launch + DMA Thread launch 8 Meredith_Cell_SC07

9 Meredith_Cell_SC07 Threads launched only on first call Molecular dynamics: SIMD intrinsics and PPE-to-SPE signaling  Using SIMD intrinsics easily achieved 2x speedups in total runtime.  Using PPE-SPE mailboxes, SPE threads can be reused across iterations.  Thus, thread launch overheads will be completely amortized on longer runs SPE, original 1 SPE, SIMDized 1 SPE2 SPEs4 SPEs8 SPEs Runtime (sec) Full runtime Thread launch overhead

10 Meredith_Cell_SC07 Conclusion  Be aware of arithmetic costs.  Use the optimized math libraries from the SDK if it helps.  Double precision requires different kinds of optimizations.  The cell has a very high bandwidth to the SPEs.  Use asynchronous DMA to overlap communication and computation for applications that are still bandwidth bound.  Amortize expensive SPE thread launch overheads.  Launch once, and signal SPEs to start the next iteration.  Use of SIMD intrinsics can result in large speedups.  Manual loop unrolling and instruction reordering can help even if no other SIMDization is possible.

11 Meredith_Cell_SC07 Contact Jeremy Meredith Future Technologies Group Computer Science and Mathematics Division (865) Meredith_Cell_SC07