High Performance Linear Transform Program Generation for the Cell BE

Name: High Performance Linear Transform Program Generation for the Cell BE
Uploaded: 2017-12-15T10:46:25+00:00
Duration: PTM11S37
Channel: Blake Dennis
Description: High Performance Linear Transform Program Generation for the Cell BE

High Performance Linear Transform Program Generation for the Cell BE
Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

How do we harness the Cell’s impressive peak performance?
Cell Broadband Engine Multicore cpu (8 SPEs+1 PPE) SPEs: SIMD cores designed for numerical computing 256KB “local store” per SPE (scratchpad-like) Programmer-driven DMA 204 Gflop/s peak Cell BE Chip Main Mem EIB SPE LS How do we harness the Cell’s impressive peak performance?

DFT on the Cell BE Spiral generated (this paper) 350x FFTC FFTW Numerical Recipes Platform-tuned code is 350x faster. But hard to write!

Overview Background, Spiral Overview Generating DFTs for the Cell
Performance Results Concluding Remarks Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

“Fitting” Dataflow to Hardware
Core 0 Core 1 Parallel execution (multicore) Stage 1 Stage 2 Stage 3 Stage 4 Iterative Algorithm (programming ease) Stage 5 Stage 1 Recursive algorithm (memory hierarchy) Stage 2 Stage 3 Stage 4 To “fit” DFT to architecture: Various traversals Various factorizations How to map dataflow to architecture automatically?

“Fitting” Dataflow to Platform (contd.)
1 2 3 4 5 1 2 3 4 Core 0 Core 1 Intuition: rewrite formulas to obtain suitable dataflow

Program Generation in Spiral
parallelization vectorization loop optimizations constant folding scheduling …… Optimization at all abstraction levels Transform user specified Fast algorithm in SPL many choices ∑-SPL Iteration of this process to search for the fastest But that’s not all … C Code

Common Abstraction: SPL
SPL: Tensor-product representation Eg.: Cooley-Tukey fast Fourier transform (FFT): Algorithms in SPL: Products of structured sparse matrices Algorithms reduce arithmetic cost O(n2)  O(n log n) Mathematical notation exposes structure: SPL (signal processing language) Tensor products in SPL represent loop structures

Performance Results Concluding Remarks

Mapping DFTs to the Cell
Objective: High-performance transform library for Cell BE Cell BE Chip Main Mem EIB SPE LS DFT Cell’s architectural paradigms: Vectorize DFT for vector length  Vectorization Parallelize DFT across p SPEs, and use a DMA packet size of  Parallelization Optimize DFT for throughput (s DFTs required) Multibuffering Tags guide formula rewriting

Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA
SPL to Parallel Code Natural parallel construct in SPL: A x y Processor 0 Processor 1 Processor 2 Processor 3 Independent, load-balanced, communication-free operation Parallelizing other constructs in SPL: Permutations require message exchange (on-chip DMA comm.) x y Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA

Idea: rewrite algorithm at SPL level to achieve largest DMA packets
SPL to Streaming Code Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory) Idea: tensor loops become multi-buffered loops Useful for: Throughput-optimized code Large, out-of-chip sizes i'th iteration Write Ai-1 Compute Ai Read Ai+1 A A A (Trickier for other SPL constructs) x y Idea: rewrite algorithm at SPL level to achieve largest DMA packets

Generating Cell Code Transform user specified Rewriting
Fast algorithm in SPL tag guided Streamed from memory for throughput Load balanced across p SPEs SIMD kernel optimized for memory hierarchy All-to-all communication (on-chip) Loop operations in ∑-SPL Cell-specific optimized C code (intrinsics, DMA etc.)

Generated Code Sample DFT 216: 4,000+ lines of code! vectorized DMA
/* Complex-to-complex DFT size 64 on 2 SPEs */ dft_c2c_64(float *X, float *Y, int spuid) { // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs // Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier(); } vectorized DMA parallelized DFT 216: 4,000+ lines of code!

Problem Space: Options
Parallelization Base (Vectorized) SPE DFT SPE DFT Vectorization assumed Single DFT parallelized across multiple SPEs SPE DFT Main Memory Operations (Only for small DFTs) SPE DFT Multiple independent DFTs on multiple SPEs Latency optimized (default) SPE DFT SPE DFT Multiple parallelized independent DFTs Throughput, multibuffered

Problem Space: Combinations
Throughput-optimized usage scenarios Latency-optimized usage scenarios SPE DFT Parallel, multibuffered DFT Single DFT from main memory Independent DFTs multibuffered in parallel Devise rewrite rules for tags. Nestings describe all scenarios

SPE DFT 8-SPEs 4-SPEs 2-SPEs Single precision IBM QS22 1-SPE

4.5x faster than FFTW, 1.63x faster than FFTC
SPE DFT Spiral: 1-SPE Spiral: 8-SPEs FFTC FFTW 4.5x faster than FFTW, 1.63x faster than FFTC

More Performance Results
Single-SPE DFT code Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i) Mercury Spiral Chow IBM SDK

Other Linear Transforms
Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE) 2-D DFTs Out-of-core sizes Limited to 2D DFTs on 1-SPE (for now) More performance results: Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

Conclusion Automatic generation of transform libraries
High performance Variety of scenarios, formats High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture architecture space algorithm

High Performance Linear Transform Program Generation for the Cell BE

Similar presentations

Presentation on theme: "High Performance Linear Transform Program Generation for the Cell BE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Linear Transform Program Generation for the Cell BE

Similar presentations

Presentation on theme: "High Performance Linear Transform Program Generation for the Cell BE"— Presentation transcript:

Similar presentations

About project

Feedback