A Quantitative Analysis of Stream Algorithms on Raw Fabrics

A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Henry Hoffmann Anant Agarwal MIT CSAIL Boston Area Architecture Conference 21 January 2005

This talk explores practical applications of the theoretical framework
Introduction Raw is a tiled microarchitecture characterized by: Low latency, high bandwidth networks Relatively small local memories, far from large backing memories Scalable hardware design allowing large raw fabrics to be built Raw is one of many single-chip, tiled microarchitectures Address growing concerns of wire delay and power consumption The Decoupled Systolic Architecture captures key features Provides a theoretical tool to explore performance on tiled archs. Allows performance characterization of algorithms This talk explores practical applications of the theoretical framework

Outline Decoupled Systolic Architecture and Stream Algorithms
Stream Algorithms on Raw Experimental Methodology Results Conclusion

Stream Algorithms Decoupled Systolic Architecture
Decoupled Systolic Algorithms Efficiency: E(N,R) = where N = problem size, R = length of array side, C = total number of operations, T = total number of time steps, P(R) + M(R) = total number of tiles C(N) R T(N, R) * (P(R) + M(R)) M(R) memory tiles – memory management units, only tiles that can access memory other than registers P(R) compute tiles – perform systolic computations, accessing only registers and networks Stream Algorithms – The class of decoupled systolic algorithms whose efficiency approaches 1 for large N and R

Methodology We use the cycle accurate Raw simulator
Assume a 425 MHz clock – maximum Raw clock speed Raw emulates the decoupled systolic architecture Raw tiles act as compute tiles – don’t use local D$ Augment Raw simulator with memory tiles on periphery These memory tiles access all data Implement stream algorithms for Matrix multiplication Triangular solver LU factorization QR factorization Measure performance as a function of N: problem size (N x N matrices) R: array dimensions (R x R array of compute tiles + 4R memory tiles)

Results on Raw Prototype
Fix R = 4 and measure computation rate for kernels Peak flop rate: 6.8 GFLOPS Computation Rate (GFLOPS) N

Results for Large Raw Fabrics
Scale Matrix Multiplication and QR Factorization, N = 1024 Examine computation rate and speedup vs. R = 4 Speedup vs. R = 4 Computation Rate (GFLOPS) R R

Conclusions Raw provides scalable hardware
Stream algorithms provide scalable software Together yield high-performance implementations Matrix multiply Close to ideal speedup, rapidly approaches peak performance On 1024 Raw tiles, sustained throughput of 414 GFLOPS QR Factorization Parallel efficiency of 75% on 1024 Raw Tiles Sustained throughput of 294 GFLOPS Future Work Automatic generation of stream algorithms Experimenting with template based approach Implementation of an entire application Candidate apps: MPEG encode/decode, DSP, scientific simulation Extend stream algorithm framework Develop a robust, formal notion of stream algorithms

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Similar presentations

Presentation on theme: "A Quantitative Analysis of Stream Algorithms on Raw Fabrics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Similar presentations

Presentation on theme: "A Quantitative Analysis of Stream Algorithms on Raw Fabrics"— Presentation transcript:

Similar presentations

About project

Feedback