Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Similar presentations


Presentation on theme: "Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke"— Presentation transcript:

1 Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10, 2013 1

2 Augmented Reality Physical world + Computer generated inputs Commerce
2

3 Augmented Reality Physical world + Computer generated inputs Commerce
Information 3

4 Augmented Reality Physical world + Computer generated inputs Commerce
Information Games Compared to multimedia applications, User interactive Computationally intensive 4

5 Application Characteristics
69% in data parallel loops (DLP loops) => SIMD / Coarse-Grained Reconfigurable Architecture (CGRA) 15% in software pipelinable loops (SWP loops) => CGRA Feature Extracting Kernels Virtual Object Rendering Kernel Video Conferencing with Virtual Object Manipulation 5

6 SIMD vs. CGRA SIMD Identical lanes
Shared instruction fetch (Same schedule across PEs) SIMD memory access PE# : SIMD lane instruction PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 6

7 SIMD vs. CGRA Homogeneous CGRA Heterogeneous CGRA Identical units
Mesh-like interconnects Software pipelining Heterogeneous CGRA More energy efficient than homogeneous CGRA Less performance compared to homogeneous CGRA Software pipelining PE0 PE1 PE2 PE3 PE# : Processing Element with All Units PE4 PE5 PE6 PE7 PE# : Processing Element with Multipliers PE8 PE9 PE10 PE11 PE# : Processing Element with Memory Units PE# : Processing Element without Complex Units PE12 PE13 PE14 PE15 7

8 SIMD vs. CGRA In DLP loops, SIMD > CGRA
In total execution time and energy, CGRA > SIMD (due to SWP loops) In energy consumption, heterogeneous CGRA > homogeneous CGRA (20% less energy with only 4% performance loss) 8

9 Adding SIMD Support for CGRA
Heterogeneous CGRA Grouping multiple PEs to form an identical SIMD core PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 SIMD Core PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 How do we obtain the efficiency of single instruction fetch? How do we achieve the efficiency of SIMD memory access? 9

10 Efficient Instruction Fetch
Fetch instruction once from memory Pass around the instruction to the next SIMD core Last SIMD core stores the instruction in a recycle buffer SIMD Core 0 SIMD Core 3 iteration 0 iteration 4 iteration 3 SIMD Core 1 SIMD Core 2 iteration 1 iteration 2 10

11 SIMD Memory Access Single memory request, multiple responses
Split transaction Enables forwarding (Request ID) SIMD mode flag, stride information Split transaction Enables forwarding (Request ID) MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 11

12 Experimental Setup Baseline Our solution Compiler Power
Heterogeneous CGRA with 16 PEs 4 PEs with memory units, 4 PEs with multipliers Our solution Baseline + SIMD support 1 cycle latency ring network, 16-entry recycle buffer Compiler IMPACT frontend compiler Edge-centric modulo scheduler ADRES framework Power 65nm 200MHz/1V CACTI 12

13 Evaluation for DLP Loops
16 cores (SIMD) vs. 4 SIMD cores (Our solution) ILP within the loops - Our solution is 14.1% slower compared to SIMD. - Our solution achieves nearly the same energy efficiency as SIMD. 13

14 Evaluation for Total Execution
Our solution achieves 17.6% speedup with 16.9% less energy compared to baseline heterogeneous CGRA. 14

15 Conclusion Best performing / energy-efficient solution
DLP loops : SIMD Whole application : CGRA Two techniques to implement SIMD support efficiently for CGRAs. Efficient instruction fetch : Ring network + recycle buffer SIMD memory access : Split transaction + stride information in header Results in 3.4% power saving CGRAs with SIMD support improves overall performance by 17.6% with 16.9% less energy. 15

16 Questions? For more information http://cccp.eecs.umich.edu
16

17 CGRA Memory Access Resolve bank conflicts through buffering
Compiler accounts for additional buffering delay MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 17

18 Compilation Flow Loop Classification ILP Matching Acyclic Scheduling
Program Loop Classification DLP ILP Matching SWP High ILP Low ILP Acyclic Scheduling Modulo Scheduling Code Generation Executable 18

19 Power Analysis - SIMD mode further saves power by 3.4%.
(-) Savings from memory (+) Overheads from ring network, recycle buffer, and SIMD memory access - SIMD mode further saves power by 3.4%. 19

20 Resource Utilization in DLP Loops
SIMD mode can utilize 13.6% more resources in DLP loops. - Compiler generates more efficient schedule with fewer resources. (Less routing, less exploration) 20

21 PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 21


Download ppt "Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke"

Similar presentations


Ads by Google