Download presentation
Presentation is loading. Please wait.
Published byBruno Kelly Modified over 9 years ago
1
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
2
IBM Research © 2008 2mpp@us.ibm.com Outline History: Data challenge Motivation for multicore Implications for programmers How Cell addresses these implications Examples 2D/3D FFT –Medical Imaging, Petroleum, general HPC… Green’s Functions –Seismic Imaging (Petroleum) String Matching –Network Processing: DPI & Intrusion Detections Neural Networks –Finance
3
IBM Research © 2008 3mpp@us.ibm.com Chapter 1: The Beast is Hungry!
4
IBM Research © 2008 4mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources
5
IBM Research © 2008 5mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe Pipe too small = starved beast Pipe big enough = well-fed beast Pipe too big = wasted resources If flops grow faster than pipe capacity… … the beast gets hungrier!
6
IBM Research © 2008 6mpp@us.ibm.com Move the food closer Example: Intel Tulsa –Xeon MP 7100 series –65nm, 349mm2, 2 Cores –3.4 GHz @ 150W –~54.4 SP GFlops –http://www.intel.com/products /processor/xeon/index.htm Large cache on chip –~50% of area –Keeps data close for efficient access If the data is local, the beast is happy! –True for many algorithms
7
IBM Research © 2008 7mpp@us.ibm.com What happens if the beast is still hungry? Data Cache If the data set doesn’t fit in cache –Cache misses –Memory latency exposed –Performance degraded Several important application classes don’t fit –Graph searching algorithms –Network security –Natural language processing –Bioinformatics –Many HPC workloads
8
IBM Research © 2008 8mpp@us.ibm.com Make the food bowl larger Data Cache Cache size steadily increasing Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS
9
IBM Research © 2008 9mpp@us.ibm.com Make the food bowl larger Data Cache Cache size steadily increasing Implications –Chip real estate reserved for cache –Less space on chip for computes –More power required for fewer FLOPS But… –Important application working sets are growing faster –Multicore even more demanding on cache than uni-core
10
IBM Research © 2008 10mpp@us.ibm.com Chapter 2: The Beast Has Babies
11
IBM Research © 2008 11mpp@us.ibm.com Power Density – The fundamental problem
12
IBM Research © 2008 12mpp@us.ibm.com What’s causing the problem? Gate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) Power Density (W/cm 2 ) 65 nM Gate Length (microns) 10.010.1 1000 100 10 1 0.1 0.01 0.001 Power, signal jitter, etc...
13
IBM Research © 2008 13mpp@us.ibm.com Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points
14
IBM Research © 2008 14mpp@us.ibm.com Power vs Performance Trade Offs 1 1.45 1.3.85 1.7 We need to adapt our algorithms to get performance out of multicore
15
IBM Research © 2008 15mpp@us.ibm.com Implications of Multicore There are more mouths to feed – Data movement will take center stage Complexity of cores will stop increasing … and has started to decrease in some cases Complexity increases will center around communication Assumption – Achieving a significant % or peak performance is important
16
IBM Research © 2008 16mpp@us.ibm.com Chapter 3: The Proper Care and Feeding of Hungry Beasts
17
IBM Research © 2008 17mpp@us.ibm.com Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W
18
IBM Research © 2008 18mpp@us.ibm.com Feeding the Cell Processor 8 SPEs each with –LS –MFC –SXU PPE –OS functions –Disk IO –Network IO 16B/cycle (2x)16B/cycle BIC FlexIO TM MIC Dual XDR TM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC
19
IBM Research © 2008 19mpp@us.ibm.com Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth
20
IBM Research © 2008 20mpp@us.ibm.com Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory and each SPE’s local store –Use SPE’s DMA engine to gather & scatter data between memory main memory and local store –Enables detailed programmer control of data flow Get/Put data when & where you want it Hides latency: Simultaneous reads, writes & computes –Avoids restrictive HW cache management Unlikely to determine optimal data flow Potentially very inefficient –Allows more efficient use of the existing bandwidth BOTTOM LINE: It’s all about the data!
21
IBM Research © 2008 21mpp@us.ibm.com Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale)
22
IBM Research © 2008 22mpp@us.ibm.com Memory Managing Processor vs. Traditional General Purpose Processor IBM AMD Intel Cell BE
23
IBM Research © 2008 23mpp@us.ibm.com Examples of Feeding Cell 2D and 3D FFTs Seismic Imaging String Matching Neural Networks (function approximation)
24
IBM Research © 2008 24mpp@us.ibm.com Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer SIMDized data DMAs double buffered Pass 1: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer Pass 2: For each buffer DMA Get buffer Do four 1D FFTs in SIMD Transpose tiles DMA Put buffer
25
IBM Research © 2008 25mpp@us.ibm.com 3D FFTs Long stride trashes cache Cell DMA allows prefetch Single ElementData envelope Stride 1 Stride N 2 N
26
IBM Research © 2008 26mpp@us.ibm.com Feeding Seismic Imaging to Cell (X,Y) New G at each (x,y) Radial symmetry of G reduces BW requirements Data Green’s Function
27
IBM Research © 2008 27mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7
28
IBM Research © 2008 28mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0SPE 1SPE 2SPE 3SPE 4SPE 5SPE 6SPE 7
29
IBM Research © 2008 29mpp@us.ibm.com Feeding Seismic Imaging to Cell For each X –Load next column of data –Load next column of indices –For each Y Load Green’s functions SIMDize Green’s functions Compute convolution at (X,Y) –Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2
30
IBM Research © 2008 30mpp@us.ibm.com Feeding String Matching to Cell Find (lots of) substrings in (long) string Build graph of words & represent as DFA Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”
31
IBM Research © 2008 31mpp@us.ibm.com Feeding String Matching to Cell
32
IBM Research © 2008 32mpp@us.ibm.com Hiding Main Memory Latency
33
IBM Research © 2008 33mpp@us.ibm.com Software Multithreading
34
IBM Research © 2008 34mpp@us.ibm.com Feeding Neural Networks to Cell Neural net function F(X) – RBF, MLP, KNN, etc. If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X
35
IBM Research © 2008 35mpp@us.ibm.com Convert BW Bound to Compute Bound Split function over multiple SPEs Avoids unnecessary memory traffic Reduce compute time per SPE Minimal merge overhead Merge
36
IBM Research © 2008 36mpp@us.ibm.com Moral of the Story: It’s All About the Data! The data problem is growing: multicore Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning:Make every core work! – Software multithreading: Keep every core busy!
37
IBM Research © 2008 37mpp@us.ibm.com Backup
38
IBM Research © 2008 38mpp@us.ibm.com Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.