Mapping DSP algorithms to a general purpose out-of-order processor

Mapping DSP algorithms to a general purpose out-of-order processor
ECE 734 Ilhyun Kim Donghyun Baik

Outline Introduction Out-of-order execution overview
Dependence graph change for mapping to GPP To-do list Expected results

Introduction Why DSP applications are implemented on GPP
Lower development cost commodity part, lower maintenance cost, faster turn-around GPP meets performance requirement GPP becomes faster. Only DSP chips could do it in the past Problems with algorithm transformations Used for faster operations / efficient hardware On GPP, no control over hardware configurations Some of them are effective while others not Problems with extracting parallelism Duplicated efforts on each of layers (source code, compiler, processor) Narrow machine scope over independent operations What are the efficient ways to map algorithm to GPP? Understanding how a GPP executes instructions How can software improve the performance?

Out-of-order execution overview
Dynamic parallelism extraction Instruction re-ordering Dynamically searching independent operations within a limited scope Trying to keep all available hardware resources busy for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) +c(i-1,j) functional units

Understanding DG change for mapping to GPP
instruction window for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j) instruction window instruction window instruction window instruction window computation instruction window instruction window mem access instruction window instruction window instruction window control-related

To-do list Infrastructure The effect of single assignment
Building a perfect machine model simulator that keeps track of only computations in the algorithm, measuring ideal execution time of a compiled binary assuming perfect parallelism Building a profile tool that locates an instruction that we are interested in among instructions in the binary The effect of single assignment Characterizations on various machine configurations The effect of unfolding The effect of SIMD parallelism Optimization techniques for the Alpha architecture based on characterization data Optimizing an existing DSP application: MPEG-2 decoder 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp

Expected Results Single assignment transformation doesn’t work
Rather, try to recycle storage space whenever possible HW-based single assignment is performed Unfolding transformation It works on iteration-independent loops w/ trivial computations It reduces the loop indices overhead (even on iteration-dependent loops) Do not unfold loops w/ non-trivial computations SIMD parallelism to reduce memory communications Alpha doesn’t support SIMD instruction sets but it has 64-bit datapath and instructions read/write 64 bits at a time by splitting/merging narrow words There are more computing units (4) than memory units (2) : reducing memory operations helps performance Performance improvement of MPEG-2 decoder based on the optimizations that we applied 20 inst integer window, 15 inst fp window 2 int, 2 agen/mem 2fp

Mapping DSP algorithms to a general purpose out-of-order processor

Similar presentations

Presentation on theme: "Mapping DSP algorithms to a general purpose out-of-order processor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mapping DSP algorithms to a general purpose out-of-order processor

Similar presentations

Presentation on theme: "Mapping DSP algorithms to a general purpose out-of-order processor"— Presentation transcript:

Similar presentations

About project

Feedback