Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Similar presentations


Presentation on theme: "A Quantitative Analysis of Stream Algorithms on Raw Fabrics"— Presentation transcript:

1 A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Henry Hoffmann Anant Agarwal MIT CSAIL Boston Area Architecture Conference 21 January 2005

2 This talk explores practical applications of the theoretical framework
Introduction Raw is a tiled microarchitecture characterized by: Low latency, high bandwidth networks Relatively small local memories, far from large backing memories Scalable hardware design allowing large raw fabrics to be built Raw is one of many single-chip, tiled microarchitectures Address growing concerns of wire delay and power consumption The Decoupled Systolic Architecture captures key features Provides a theoretical tool to explore performance on tiled archs. Allows performance characterization of algorithms This talk explores practical applications of the theoretical framework

3 Outline Decoupled Systolic Architecture and Stream Algorithms
Stream Algorithms on Raw Experimental Methodology Results Conclusion

4 Stream Algorithms Decoupled Systolic Architecture
Decoupled Systolic Algorithms Efficiency: E(N,R) = where N = problem size, R = length of array side, C = total number of operations, T = total number of time steps, P(R) + M(R) = total number of tiles C(N) R T(N, R) * (P(R) + M(R)) M(R) memory tiles – memory management units, only tiles that can access memory other than registers P(R) compute tiles – perform systolic computations, accessing only registers and networks Stream Algorithms – The class of decoupled systolic algorithms whose efficiency approaches 1 for large N and R

5 Methodology We use the cycle accurate Raw simulator
Assume a 425 MHz clock – maximum Raw clock speed Raw emulates the decoupled systolic architecture Raw tiles act as compute tiles – don’t use local D$ Augment Raw simulator with memory tiles on periphery These memory tiles access all data Implement stream algorithms for Matrix multiplication Triangular solver LU factorization QR factorization Measure performance as a function of N: problem size (N x N matrices) R: array dimensions (R x R array of compute tiles + 4R memory tiles)

6 Results on Raw Prototype
Fix R = 4 and measure computation rate for kernels Peak flop rate: 6.8 GFLOPS Computation Rate (GFLOPS) N

7 Results for Large Raw Fabrics
Scale Matrix Multiplication and QR Factorization, N = 1024 Examine computation rate and speedup vs. R = 4 Speedup vs. R = 4 Computation Rate (GFLOPS) R R

8 Conclusions Raw provides scalable hardware
Stream algorithms provide scalable software Together yield high-performance implementations Matrix multiply Close to ideal speedup, rapidly approaches peak performance On 1024 Raw tiles, sustained throughput of 414 GFLOPS QR Factorization Parallel efficiency of 75% on 1024 Raw Tiles Sustained throughput of 294 GFLOPS Future Work Automatic generation of stream algorithms Experimenting with template based approach Implementation of an entire application Candidate apps: MPEG encode/decode, DSP, scientific simulation Extend stream algorithm framework Develop a robust, formal notion of stream algorithms


Download ppt "A Quantitative Analysis of Stream Algorithms on Raw Fabrics"

Similar presentations


Ads by Google