IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

IBM Research © 2008 2mpp@us.ibm.com Outline  History: Data challenge  Motivation for multicore  Implications for programmers  How Cell addresses these implications  Examples 2D/3D FFT –Medical Imaging, Petroleum, general HPC… Green’s Functions –Seismic Imaging (Petroleum) String Matching –Network Processing: DPI & Intrusion Detections Neural Networks –Finance

IBM Research © 2008 5mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources  If flops grow faster than pipe capacity… … the beast gets hungrier!

IBM Research © 2008 6mpp@us.ibm.com Move the food closer  Example: Intel Tulsa –Xeon MP 7100 series –65nm, 349mm2, 2 Cores –3.4 GHz @ 150W –~54.4 SP GFlops –http://www.intel.com/products /processor/xeon/index.htm  Large cache on chip –~50% of area –Keeps data close for efficient access  If the data is local, the beast is happy! –True for many algorithms

IBM Research © 2008 7mpp@us.ibm.com What happens if the beast is still hungry? Data Cache  If the data set doesn’t fit in cache –Cache misses –Memory latency exposed –Performance degraded  Several important application classes don’t fit –Graph searching algorithms –Network security –Natural language processing –Bioinformatics –Many HPC workloads

IBM Research © 2008 8mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS

IBM Research © 2008 9mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS  But… –Important application working sets are growing faster –Multicore even more demanding on cache than uni-core

IBM Research © 2008 12mpp@us.ibm.com What’s causing the problem? Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Power Density (W/cm 2 ) 65 nM Gate Length (microns) 10.010.1 1000 100 10 1 0.1 0.01 0.001 Power, signal jitter, etc...

IBM Research © 2008 13mpp@us.ibm.com Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points

IBM Research © 2008 15mpp@us.ibm.com Implications of Multicore  There are more mouths to feed – Data movement will take center stage  Complexity of cores will stop increasing … and has started to decrease in some cases  Complexity increases will center around communication  Assumption – Achieving a significant % or peak performance is important

IBM Research © 2008 18mpp@us.ibm.com Feeding the Cell Processor  8 SPEs each with –LS –MFC –SXU  PPE –OS functions –Disk IO –Network IO 16B/cycle (2x)16B/cycle BIC FlexIO TM MIC Dual XDR TM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC

IBM Research © 2008 19mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth

IBM Research © 2008 20mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth  BOTTOM LINE: It’s all about the data!

IBM Research © 2008 24mpp@us.ibm.com Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer  SIMDized data  DMAs double buffered  Pass 1: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer  Pass 2: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer

IBM Research © 2008 29mpp@us.ibm.com Feeding Seismic Imaging to Cell  For each X –Load next column of data –Load next column of indices –For each Y Load Green’s functions SIMDize Green’s functions Compute convolution at (X,Y) –Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2

IBM Research © 2008 30mpp@us.ibm.com Feeding String Matching to Cell  Find (lots of) substrings in (long) string  Build graph of words & represent as DFA  Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”

IBM Research © 2008 34mpp@us.ibm.com Feeding Neural Networks to Cell  Neural net function F(X) – RBF, MLP, KNN, etc.  If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X

IBM Research © 2008 35mpp@us.ibm.com Convert BW Bound to Compute Bound  Split function over multiple SPEs  Avoids unnecessary memory traffic  Reduce compute time per SPE  Minimal merge overhead Merge

IBM Research © 2008 36mpp@us.ibm.com Moral of the Story: It’s All About the Data!  The data problem is growing: multicore  Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching  Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning:Make every core work! – Software multithreading: Keep every core busy!

IBM Research © 2008 38mpp@us.ibm.com Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Similar presentations

Presentation on theme: "IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Similar presentations

Presentation on theme: "IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept."— Presentation transcript:

Similar presentations

About project

Feedback