Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun Park 1, Hyunchul Park 2, Scott Mahlke 1, and Sukjin Kim 3 October 25, 2010 1 University of Michigan, Ann Arbor 2 Texas Instruments, Inc. 3 Samsung Advanced Institute of Technology

2 University of Michigan Electrical Engineering and Computer Science Convergence of Functionalities 2 Convergence of functionalities demands a flexible solution Applications have different characteristics Anatomy of iPhone 4G Wireless Navigation Audio Video 3D Flexible Accelerator!

3 University of Michigan Electrical Engineering and Computer Science 3 Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration

4 University of Michigan Electrical Engineering and Computer Science Execution Model of CGRAs 4 for ( …… ) { } time Host CGRA Exploiting loop-level parallelism

5 University of Michigan Electrical Engineering and Computer Science Multi-core accelerator : each 2x2 array becomes a processor Composability: Cores can be combined to form a larger logical core Exploit both coarse-grain and fine-grain pipeline parallelism No dynamic routing logic: all communications statically generated 5 Core Logical Core0 Logical Core1 Logical Core2 Polymorphic Pipeline Array time

6 University of Michigan Electrical Engineering and Computer Science Where PPA stands 24 fps min. Frames/sec MPEG-4 Decoder Cell-phone battery life (hours) 6 energy performance

7 University of Michigan Electrical Engineering and Computer Science Resource Waste(1): Multi-Apps 7 Different applications require different Array size of PPA AAC audio decoding(128K bit-rate): 4 Cores H.264 decoding(30fps@640x480): 6 Cores 3D rendering: 8+ Cores Solution: selective dynamic voltage scaling AAC audio Decoding 128K bit-rate H.264 decoder 30 fps@640x480 3D Rendering 200Mhz@ IBM 0.65nm Technology

8 University of Michigan Electrical Engineering and Computer Science 0 1 2 3 4 5 6 7 Resource Waste(2): Intra-App 8 Different code regions require different logic core size Acyclic region: 1 Core, Cyclic region(loop): 2~6 Cores Solution 1: resource sharing Solution 2: grouping by region type Cyclic0 31% Cyclic1 68% Acyclic1 32% Acyclic0 14% Cyclic0 +Cyclic1 99% Cyclic0 31% Cyclic1 68% Acyclic1 32% Acyclic0 14% Cyclic0 +Cyclic1 99% Acyclic1 32% Acyclic0 14% 3D Rendering 14% 31% 32% 68% Group 0Group 1 100% 45%

9 University of Michigan Electrical Engineering and Computer Science Task Graph Software Pipelining 9 Resource Time Core 0Core 1 Core 0 T 2 ≈ ½T 1 T1T1 01 01 01 Prologue Epilogue Time

10 University of Michigan Electrical Engineering and Computer Science Let’s software pipeline a given stream-style task graph on a composable accelerator. –Achieve efficient schedule of tasks –Collect unused resources and reuse them to work Notable observation –The task graph is fixed by application Cannot modify the graph itself Acceleration is performed by merging cores Goal 10

11 University of Michigan Electrical Engineering and Computer Science Compilation Challenges Resource requirement variance –Task specific characteristic: fixed! –Inter-task variance Execution time variance –Static variance b/w tasks Geometry –Sparse core connectivity Reconfiguration overhead –Not a critical problem on PPA, but still incur resource waste 11

12 University of Michigan Electrical Engineering and Computer Science A(n+1) B(n+1)D(n) C(n)A(n-1) B(n-1)D(n-2) C(n-2) Example of Resource Waste 12 A(n) B(n)D(n-1) C(n-1) 012012 012012 core Time Memory C(n-1) D(n-1) 012 Resource Conflict Reconfiguration deadline A(n) Load imbalance Static execution time variance Inter-task resource requirement variance Reconfiguration overhead B A C D B(n-1) D(n-2)C(n-1)

13 University of Michigan Electrical Engineering and Computer Science PPA Compilation procedure 13 Compilation ChallengesPrepassCore allocationPostpass Resource requirement ⩗ Execution time variance ⩗ Geometry Reconfiguration ⩗ Load imbalance Static partitioning Static partitioning Physical core mapping Physical core mapping Dynamic partitioning Dynamic partitioning PrepassCore allocationPostpass ⩗ ⩗ Grouping by resource assignmentresource sharing

14 University of Michigan Electrical Engineering and Computer Science Prepass : Static partitioning Grouping tasks with same resource assignment –Goal (avoid resource conflict!) Keep core-merging configuration at runtime Give more resource to slowest task –Minimize stall time due to the resource conflict or reconfiguration A1 B1D0 C0 012012 core Time A2 B1 D0 C1 012012 core Time VS. 14 Resource conflict 2 Stage(AB->CD) No resource conflict 3 Stage(A>BC->D)

15 University of Michigan Electrical Engineering and Computer Science Example D E A B F C 0123456701234567 0123456701234567 core Time Deadline E EE EE EEEE D DD Group 0 Group 1Group 2 Group 3 B A C D E 15 F

16 University of Michigan Electrical Engineering and Computer Science Core allocation Mapping virtual cores to real PPA structure –Goal: avoid geometric failure –Cores with same task must be physically adjacent –Core groups with the highest workload is preferred to be placed next to the core groups with the lowest workload (for post process) Physical Core01234567 Virtual Core12034567 FilterA B FCDDEEEE 0 1 2 3 4 5 6 7 DE A B F C Low workload high workload 16

17 University of Michigan Electrical Engineering and Computer Science Final performance tuning –Goal: minimized the highest execution time reusing neighbor core’s resource. –Software pipelining for special fraction of the task using both own and neighbor core’s resources. Postpass: Dynamic partitioning DE A B F C 0123456701234567 A B F C0 C2 C4 C1 C3 core Time Deadline 17

18 University of Michigan Electrical Engineering and Computer Science Prepass: Max 2.44x Postpass: Max1.11x Overall performance gain: Max 2.50x Experiment 18

19 University of Michigan Electrical Engineering and Computer Science Dynamic Partitioning(3D) Group 1 Group 2 Group 3 Original Pipeline Deadline Execution time Iteration 19 Performance Gain Performance Gain New Pipeline Deadline

20 University of Michigan Electrical Engineering and Computer Science Software pipelining of stream graph in a composable architecture has many compilation challenges Proposed 3-step process can effectively map stream task graph on a composable accelerator with high resource utilization Up to 250% performance improvement with both static and dynamic partitioning To enhance the performance further, dynamic variance should be considered. Conclusion 20

21 University of Michigan Electrical Engineering and Computer Science 21 Questions? For more information http://cccp.eecs.umich.edu

22 University of Michigan Electrical Engineering and Computer Science Dynamic Partitioning(H.264) Performance Loss! Performance Loss! 22 Pipeline Deadline Group 1 Group 2

23 University of Michigan Electrical Engineering and Computer Science Performance-Power of PPA 23 Tensilica Diamond Core 12 MIPS/mW PPA 9.6 MIPS/mW TI C6x 5 MIPS/mW Itanium2 0.08 MIPS/mW ARM11 3.9 MIPS/mW

24 University of Michigan Electrical Engineering and Computer Science Performance-Power of PPA 24 Power Consumption (mW)Performance (MIPS) Tensilica Diamond Core34.2410.412 TI C6x34017005 ARM11192.47403.846154 XScale240010000.416667 Itanium25100040800.08 PPA255.0624509.605583

25 University of Michigan Electrical Engineering and Computer Science ASIC Alternatives FPGAs General Purpose Processors DSPs Domain-specific accelerators Efficiency, Performance Flexibility 25 ASICs ??? Highly efficient, programmability


Download ppt "University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun."

Similar presentations


Ads by Google