Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Similar presentations


Presentation on theme: "Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science."— Presentation transcript:

1 Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012 Hong Kong

2 Reconfigurable Architecture 2/20  Reconfigurable arc hitecture  High performance  Flexible  Cf. ASIC  Energy efficient  Cf. GPU Source: ChipDesignMag.com

3 Coarse-Grained Reconfigurable Architecture 3 /20  Coarse-Grained RA  Word-level granularity  Dynamic reconfigurability  Simpler to compile  Execution model Main Processor CGRA Main Memory DMA Controller MorphoSys ADRES

4 Application Mapping 4 /20  Place and route DFG on the PE array mapping space  Should satisfy several constraints  Should map nodes on the PE which have a right functionality  Data transfer between nodes should be guaranteed  Resource consumption should be minimized for performance Application IR Front-end Partitioner Conventional C compilation ConfigurationAssembly Exec. + Config. Extended assembler Seq Code Loops Place & Route DFG generation Arch Param. Mapping for CGRA

5  Modulo scheduling-based mapping 5 /20 Software Pipelining time 01 2 4 3 A[i]B[i] C[i] PE0 PE3 PE1 PE2 PE0PE1PE2PE3 1 2 3 4 5 6 7 01 2 4 3 01 2 4 3 01 2 4 3 II = 2 cycles II : Initiation Interval

6  Suffer several problems in a large scale CGRA  Lack of parallelism  Limited ILP in general applications  Configuration size(in unrolling case)  Search a very large mapping space for placement and routing  Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most. 6 /20 Problem - Scalability

7 Overview 7 /20  Background  SIMD Reconfigurable Architecture (SIMD RA)  Mapping on SIMD RA  Evaluation

8  Consists of multiple identical parts, called cores  Identical for the reuse of configurations  At least one load-store PE in each core 8 /20 SIMD Reconfigurable Architecture Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4

9  More iterations executed in parallel  Scale with the PE array size  Short compilation time thanks to small mapping space  Archive denser scheduled configuration  Higher utilization and performance.  Loop must not have loop-carried dependence. 9 /20 Advantages of SIMD-RA time Large Core Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 time Core 1Core 2Core 3Core 4 Iter. 0 Iter. 1 Iter. 2 Iter. 3 Iter. 4 Iter. 5 Iter. 6 Iter. 7 Iter. 8 Iter. 9 Iter. 10 Iter. 11 Large Core Core 1Core 2 Core 3Core 4

10 Overview 10 /20  Background  SIMD Reconfigurable Architecture (SIMD RA)  Bank Conflict Minimization in SIMD RA  Evaluation

11  New mapping problem  Iteration-to-core mapping  Iteration mapping affects on the performance  related with a data mapping  affect the number of bank conflicts 11 /20 Problems of SIMD RA mapping for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1Core 2 Core 3Core 4 15 iterations

12 Iteration-to-core mappingData mapping 12 /20 Mapping schemes Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] … … for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; }

13  With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment.  Weak in stride accesses  reduce the number of utilized banks,  increase bank conflicts 13 Interleaving data placement Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Configuration Load A[i] … … … Load A[2i]

14 14 Sequential data placement  Cannot work well with SIMD mapping  Cause frequent bank conflicts  Data tiling  i) array base address modification  ii) rearranging data on the local memory.  Sequential iteration assignment with data tiling suits for SIMD mapping 14 Crossbar Switch A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] …… Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Configuration Load A[i] … … …

15  Two out of the four combinations have strong advantages  Interleaved iteration, interleaved data mapping  Weak in accesses with stride  Simple data management  Sequential iteration, sequential data mapping (with data tiling)  More robust against bank conflict  Data rearranging overhead 15 /20 Summary of Mapping Combinations Analysis

16 Experimental Setup 16 /20  Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks  Target system  Two CGRA sizes – 4x4, 8x4  2x2 core with one load-store PE and one multiplier PE  Mesh + diagonal connections between PEs  Full crossbar switch between PEs and local memory banks  Compared with non-SIMD mapping  Original : non-SIMD previous mapping  SIMD : Our approach (interleaving-interleaving mapping)

17 reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA 17 /20 Configuration Size

18 18 /20 Runtime 29% 32%

19  Presented SIMD reconfigurable architecture  Exploit data parallelism and instruction level parallelism at the same time  Advantages of SIMD reconfigurable architecture  Scale the large number of PEs well  Alleviate increasing compilation time  Increase performance and reduce configuration size 19 /20 Conclusion

20 Thank you! 20 /20

21  In a large loop case,  small core might not be a good match  Merge multiple cores ⇒ Macrocore  No HW modification require 21 Core size Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4 Macrocore 1 Macrocore 2

22 22 SIMD RA mapping flow Operation Mapping Check SIMD Requirement Check SIMD Requirement Select Core Size Iteration Mapping Data Tiling If scheduling fails and MaxII<II, increase core size. Traditional Mapping Fail If scheduling fails, increase II and repeat. Modulo Scheduling Array Placement (Implicit) Array Placement (Implicit) Int-IntSeq-Tiling


Download ppt "Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science."

Similar presentations


Ads by Google