Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park, Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008

2 University of Michigan Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing –10-100 Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data iPhone board

3 University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators C CodeHardware Loop LD+/-*

4 University of Michigan Electrical Engineering and Computer Science 4 FPGAs Hardware Implementations Customization gets order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9 General Purpose Processors DSPs CGRAs Loop Accelerators, ASICs Efficiency, Performance Flexibility Multifunction Loop Accelerators

5 University of Michigan Electrical Engineering and Computer Science 5 What About Programmability? Software changes – bug fixes, evolving standards dct_8x8() from H.264 reference implementation Version 13.0Version 13.1Version 13.2 for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; }

6 University of Michigan Electrical Engineering and Computer Science 6 FPGAs Programmable Loop Accelerator Reusable hardware → reduced NRE costs Generalize accelerator without losing efficiency General Purpose Processors DSPs CGRAs Loop Accelerators, ASICs Efficiency, Performance Flexibility Multifunction Loop Accelerators Programmable Loop Accelerators

7 University of Michigan Electrical Engineering and Computer Science 7 Flexible Accelerators Hardware Loop 1 Synthesis System Loop 2 Compiler Generalize accelerator architecture Map new loops to existing hardware

8 University of Michigan Electrical Engineering and Computer Science 8 Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized execution resources, storage, connectivity

9 University of Michigan Electrical Engineering and Computer Science 9 Programmable Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus ~50% area overhead vs. non-programmable accelerator Generalize architectural features that limit programmability

10 University of Michigan Electrical Engineering and Computer Science 10 Mapping Loops onto Hardware General-purposeCustomized Central register fileDistributed registers HomogeneousPoint-to-point ProcessorAccelerator FUs Storage Connectivity ALU CRF LD+/-* 8 8 16

11 University of Michigan Electrical Engineering and Computer Science 11 Scheduling Example ADDER1ADDER2MEM 0 1 2 3 4 II=2 Time +2+3 +4+5 LD1 +2+3 LD1 +2+3 +4 LD1 +3+2 +3+2 +4 +5 ?

12 University of Michigan Electrical Engineering and Computer Science 12 Modulo Scheduling for LAs Large search space, few solutions Op-centric approaches unable to find solutions Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously Move Insertion SMT Scheduling Register Allocation Loop Control Signals Machine description Increment II

13 University of Michigan Electrical Engineering and Computer Science 13 SMT Formulation Boolean variables X i,f,t are true if operation i is scheduled on FU f at time slot t. Integer variables S i represent stage of operation i. ( X i,fi,ti  X j,fj,tj )  ( ) sched_time(j)  sched_time(i) + lat(i) – dist(i,j)  II i j lat(i) dist(i,j) S j  II + t j  S i  II + t i + lat(i) – dist(i,j)  II More details in paper

14 University of Michigan Electrical Engineering and Computer Science 14 Measuring Programmability How well can different loops be mapped onto the same hardware? Performance matters – how much does II increase? Need set of loops with different degrees of similarity FU Hardware Loop ?

15 University of Michigan Electrical Engineering and Computer Science 15 Graph Perturbation Synthetically generated graphs More perturbations → less similar to original graph Iteratively apply random transformations: Add edge between existing operations Add edge with new producer Add edge with new consumer Remove edge

16 University of Michigan Electrical Engineering and Computer Science 16 Results – Perturbed Graphs Average II increase 4872444469 MPEG4Signal processingImageMath Base II

17 University of Michigan Electrical Engineering and Computer Science 17 Results – Restricted Datapath

18 University of Michigan Electrical Engineering and Computer Science 18 Conclusion Increase flexibility of customized hardware without sacrificing performance, efficiency Successfully map loops to heterogeneous hardware Compile times of 5 minutes – 1 hour Software changing faster than hardware → patchable ASIC

19 University of Michigan Electrical Engineering and Computer Science 19 Questions?

20 University of Michigan Electrical Engineering and Computer Science 20

21 University of Michigan Electrical Engineering and Computer Science 21 Results – Cross Compilation


Download ppt "University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability."

Similar presentations


Ads by Google