IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
IBM Research © 2008 Outline History: Data challenge Motivation for multicore Implications for programmers How Cell addresses these implications Examples 2D/3D FFT –Medical Imaging, Petroleum, general HPC… Green’s Functions –Seismic Imaging (Petroleum) String Matching –Network Processing: DPI & Intrusion Detections Neural Networks –Finance
IBM Research © 2008 Chapter 1: The Beast is Hungry!
IBM Research © 2008 The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources
IBM Research © 2008 The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources If flops grow faster than pipe capacity… … the beast gets hungrier!
IBM Research © 2008 Move the food closer Example: Intel Tulsa –Xeon MP 7100 series –65nm, 349mm2, 2 Cores – W –~54.4 SP GFlops – /processor/xeon/index.htm Large cache on chip –~50% of area –Keeps data close for efficient access If the data is local, the beast is happy! –True for many algorithms
IBM Research © 2008 What happens if the beast is still hungry? Data Cache If the data set doesn’t fit in cache –Cache misses –Memory latency exposed –Performance degraded Several important application classes don’t fit –Graph searching algorithms –Network security –Natural language processing –Bioinformatics –Many HPC workloads
IBM Research © 2008 Make the food bowl larger Data Cache Cache size steadily increasing Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS
IBM Research © 2008 Make the food bowl larger Data Cache Cache size steadily increasing Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS But… –Important application working sets are growing faster –Multicore even more demanding on cache than uni-core
IBM Research © 2008 Chapter 2: The Beast Has Babies
IBM Research © 2008 Power Density – The fundamental problem
IBM Research © 2008 What’s causing the problem? Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Power Density (W/cm 2 ) 65 nM Gate Length (microns) Power, signal jitter, etc...
IBM Research © 2008 Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points
IBM Research © 2008 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore
IBM Research © 2008 Implications of Multicore There are more mouths to feed – Data movement will take center stage Complexity of cores will stop increasing … and has started to decrease in some cases Complexity increases will center around communication Assumption – Achieving a significant % or peak performance is important
IBM Research © 2008 Chapter 3: The Proper Care and Feeding of Hungry Beasts
IBM Research © 2008 Cell/B.E. Processor: 200GFLOPS ~70W
IBM Research © 2008 Feeding the Cell Processor 8 SPEs each with –LS –MFC –SXU PPE –OS functions –Disk IO –Network IO 16B/cycle (2x)16B/cycle BIC FlexIO TM MIC Dual XDR TM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC
IBM Research © 2008 Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth
IBM Research © 2008 Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth BOTTOM LINE: It’s all about the data!
IBM Research © 2008 Cell Comparison: ~4x the ~½ the power Both 65nm technology (to scale)
IBM Research © 2008 Memory Managing Processor vs. Traditional General Purpose Processor IBM AMD Intel Cell BE
IBM Research © 2008 Examples of Feeding Cell 2D and 3D FFTs Seismic Imaging String Matching Neural Networks (function approximation)
IBM Research © 2008 Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer SIMDized data DMAs double buffered Pass 1: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer Pass 2: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer
IBM Research © D FFTs Long stride trashes cache Cell DMA allows prefetch Single ElementData envelope Stride 1 Stride N 2 N
IBM Research © 2008 Feeding Seismic Imaging to Cell (X,Y) New G at each (x,y) Radial symmetry of G reduces BW requirements Data Green’s Function
IBM Research © 2008 Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7
IBM Research © 2008 Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7
IBM Research © 2008 Feeding Seismic Imaging to Cell For each X –Load next column of data –Load next column of indices –For each Y Load Green’s functions SIMDize Green’s functions Compute convolution at (X,Y) –Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2
IBM Research © 2008 Feeding String Matching to Cell Find (lots of) substrings in (long) string Build graph of words & represent as DFA Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”
IBM Research © 2008 Feeding String Matching to Cell
IBM Research © 2008 Hiding Main Memory Latency
IBM Research © 2008 Software Multithreading
IBM Research © 2008 Feeding Neural Networks to Cell Neural net function F(X) – RBF, MLP, KNN, etc. If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X
IBM Research © 2008 Convert BW Bound to Compute Bound Split function over multiple SPEs Avoids unnecessary memory traffic Reduce compute time per SPE Minimal merge overhead Merge
IBM Research © 2008 Moral of the Story: It’s All About the Data! The data problem is growing: multicore Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning:Make every core work! – Software multithreading: Keep every core busy!
IBM Research © 2008 Backup
IBM Research © 2008 Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.