Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Computer Architecture
Parallel Processing with PlayStation3 Lawrence Kalisz.
DSPs Vs General Purpose Microprocessors
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Agenda Performance highlights of Cell Target applications
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
High Performance Computing on the Cell Broadband Engine
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
History of Microprocessor MPIntroductionData BusAddress Bus
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.
Performance of mathematical software Agner Fog Technical University of Denmark
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
High performance computing architecture examples Unit 2.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Distributed Processors
Cell Architecture.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Real-Time Ray Tracing Stefan Popov.
Vector Processing => Multimedia
Operation System Program 4
Large data arrays processing on Cell Broadband Engine
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Optimizing Ray Tracing on the Cell Microprocessor David Oguns

Agenda Goals and Motivation What is SIMD? Cell Architectural Overview Implementation & Challenges Demo Performance Results Future Improvements Questions?

Goals and Motivation <3 Performance Cell is a radically new architecture  Not just multicore programming Get my hands dirty with SIMD Learning

Single Instruction Multiple Data Normal processors execute a single instruction for a single result. (Single Instruction Single Data)‏ SIMD processors execute a single instruction across multiple data for multiple results. SIMD processing sometimes called vector/stream/dsp processors Useful in graphics acceleration, signal processing, and simulation applications.

SISD vs SIMD Approach A = 1 + 2; B = 3 + 4; C = 5 + 6; D = 7 + 8; V = { 1, 3, 5, 7} + { 2, 4, 6, 8};

SIMD Continued... Desktop CPUs usually have SIMD units  MMX, SSE, 3DNow!, AltiVec Hardware likely to make heavy use of SIMD  GPUs  Ageia PhysX  Super computers  Signal processors in multimedia devices or sensors Potential to be order(s) of magnitude faster in pure math...

It's not all peachy Previous example was construed to show data and instruction parallelism.  Sum of 1+2 was not dependent on 2+3 or vice versa.  SIMDizing A + B + C + D is less efficient  SIMD can’t help with A + B + C. Data must be aligned properly in vectors.

Cell Overview Asymmetric multicore processor designed by STI group(Sony/Toshiba/IBM) for high performance computing applications. 1 Power Processing Element (PPE)‏  Much like Intel/AMD desktop CPUs but cheaper. 8 Synergistic Processing Elements (SPE)‏  SIMD processors Element Interconnect Bus (EIB)‏  4 ring 16b wide unidirectional bus  Connects PPE, SPEs, main memory, FlexIO.  Many ways to accomplish inter-element communication

PPE 64-bit 3.2Ghz Dual threaded 512KB L2 cache In order execution Transparent access to main memory SIMD unit called AltiVec

SPE SIMD core clocked at 3.2GHz  Even pipeline for most execution instructions  Odd pipeline for load/store/DMA/branch hints 128 x 128b register file 256KB Local Store  Very very low latency (7 cycles)‏ Memory Flow Controller  No direct access to main memory! No branch prediction hardware

EIB Arbitrates all communication between elements in the Cell. Runs at half the clock speed of the PPE and SPEs ~300Gb/s theoretical bandwidth  200Gb/s observed Main memory is connected to EIB – not the PPE. FlexIO

Intercore Communication DMA (Direct Memory Access)‏  Like memcpy on crack.  Used primary to keep data moving  Latency ~ hundreds of cycles  Non blocking calls Mailboxes  Short messages to and from SPEs Signals and Interrupts

Implementation Can use stock ray tracer and simply run on PPE Lets make it run faster!  Multithreading (partitioning, synchronization)‏ Pthreads, libspe2  SIMDzing (both on PPE AltiVec and SPUs)‏ IBM Toolchain Language intrinsics General C multiplatform development

Multithreading or Partitioning? How do we divide up work between N spes?  Goals are to load balance and minimize synchronization Data driven design

Data Partitioning Each SPE is given basic information  Scene address, frame buffer address, ray buffer address, number of pixels to process, sqrt(samples per pixel), numSpes, depth Entire scene copied over to LS via DMA.  Very small. SPE(N) processes every Nth pixel  Load balances very well  No synchronization with PPE necessary*

First Approach PPE  Setup scene  Generate primary rays in ray buffers  Initialize work loads for N SPEs  Launch each SPE thread  Wait for SPEs to finish running. When SPE are done, it means frame buffer is ready SPE  Receive workload  Use multibuffered DMA to transfer and process primary rays  Output pixels using DMA list to main memory.  Terminate when DMA for outgoing pixels is complete.

First Approach = FAIL DMA is hard to use for streaming data...  Must implement multibuffered solution for streaming rays  Outputting scattered pixels is a pain Generating all primary rays at once uses a lot of memory.  Ray size: 48bytes (36bytes->48bytes with padding)‏  Pixel size: 4bytes. 3MB frame buffer -> 36MB ray buffer (1024x768)  Super sampling makes it even worse! 3MB frame buffer -> 144MB ray buffer at 4x super sampling  Removed ray buffer from normal ray tracer as well.

Second Approach PPE  Setup scene Extra view plane information in scene  Initialize work loads for N spes  Launch each SPE thread  Poll each SPE's outbound mailbox for outgoing pixels.  Write pixel to frame buffer.  Done when last pixel received. SPE  Receive info about workload  Generate and process primary rays on the fly  Output pixel using mailbox. (blocking call)‏

SIMDizing Implemented my own vector library  Modified using conditional compilation  Later removed entire calls to my own functions IBM Toolchain allows for some clean syntax. Some of the compiler intrinsics work for both PPE AltiVec and SPEs.

General C Multiplatform Development Loading scene was identical (PPE) Re-wrote start up code for PPE and SPE Ray tracing / shading algorithms  Identical on SPE  C math not available. SIMD Math instead.

Challenges First time doing something like this. IBM SDK does not install easily.  Using IBM SDK libraries was a pain. Indirect nesting in data structures  Keeping data structures the same size on PPE and SPE.  PPE pointer: 64bits; SPE pointer: 32bits Aligning memory...

This explains most of it... “Another common error in SPU programs is a DMA that specifies an invalid combination of Local Store address, effective address, and transfer size. The alignment rules for DMAs specify that transfers for less than 16 bytes must be "naturally aligned," meaning that the address must be divisible by the size. Transfers of 16 bytes or more must be 16-byte aligned. The size can have a value of 1, 2, 4, 8, 16, or a multiple of 16 bytes to a maximum of 16KB. In addition, the low-order four bits of the Local Store address must match the low-order four bits of effective address (in other words, they must have the same alignment within a quadword). Any DMA that violates one of these rules will generate an alignment exception which is presented to the user as a bus error.”

Demo (sort of...)‏

Performance Results Recursion depth of 4 PS3 Cell implies use of 6 SPEs  No PPE multithreading Pentium 4 trials used no hyperthreading.

Performance Results

Test run at 1600x1200x4 Run Time versus Number of SPEs

Future Improvements Partition data differently  Eliminate all synchronization More efficient use of SIMD  Colors were essentially vectors. Eliminate branches, add branch hints Learn to use IBM's profiling tools Real time rendering is possible!

Questions?