Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10, 2013 1
Augmented Reality Physical world + Computer generated inputs Commerce 2
Augmented Reality Physical world + Computer generated inputs Commerce Information 3
Augmented Reality Physical world + Computer generated inputs Commerce Information Games Compared to multimedia applications, User interactive Computationally intensive 4
Application Characteristics 69% in data parallel loops (DLP loops) => SIMD / Coarse-Grained Reconfigurable Architecture (CGRA) 15% in software pipelinable loops (SWP loops) => CGRA Feature Extracting Kernels Virtual Object Rendering Kernel Video Conferencing with Virtual Object Manipulation 5
SIMD vs. CGRA SIMD Identical lanes Shared instruction fetch (Same schedule across PEs) SIMD memory access PE# : SIMD lane instruction PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 6
SIMD vs. CGRA Homogeneous CGRA Heterogeneous CGRA Identical units Mesh-like interconnects Software pipelining Heterogeneous CGRA More energy efficient than homogeneous CGRA Less performance compared to homogeneous CGRA Software pipelining PE0 PE1 PE2 PE3 PE# : Processing Element with All Units PE4 PE5 PE6 PE7 PE# : Processing Element with Multipliers PE8 PE9 PE10 PE11 PE# : Processing Element with Memory Units PE# : Processing Element without Complex Units PE12 PE13 PE14 PE15 7
SIMD vs. CGRA In DLP loops, SIMD > CGRA In total execution time and energy, CGRA > SIMD (due to SWP loops) In energy consumption, heterogeneous CGRA > homogeneous CGRA (20% less energy with only 4% performance loss) 8
Adding SIMD Support for CGRA Heterogeneous CGRA Grouping multiple PEs to form an identical SIMD core PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 SIMD Core PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 How do we obtain the efficiency of single instruction fetch? How do we achieve the efficiency of SIMD memory access? 9
Efficient Instruction Fetch Fetch instruction once from memory Pass around the instruction to the next SIMD core Last SIMD core stores the instruction in a recycle buffer SIMD Core 0 SIMD Core 3 iteration 0 iteration 4 iteration 3 SIMD Core 1 SIMD Core 2 iteration 1 iteration 2 10
SIMD Memory Access Single memory request, multiple responses Split transaction Enables forwarding (Request ID) SIMD mode flag, stride information Split transaction Enables forwarding (Request ID) MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 11
Experimental Setup Baseline Our solution Compiler Power Heterogeneous CGRA with 16 PEs 4 PEs with memory units, 4 PEs with multipliers Our solution Baseline + SIMD support 1 cycle latency ring network, 16-entry recycle buffer Compiler IMPACT frontend compiler Edge-centric modulo scheduler ADRES framework Power 65nm technology @ 200MHz/1V CACTI 12
Evaluation for DLP Loops 16 cores (SIMD) vs. 4 SIMD cores (Our solution) ILP within the loops - Our solution is 14.1% slower compared to SIMD. - Our solution achieves nearly the same energy efficiency as SIMD. 13
Evaluation for Total Execution Our solution achieves 17.6% speedup with 16.9% less energy compared to baseline heterogeneous CGRA. 14
Conclusion Best performing / energy-efficient solution DLP loops : SIMD Whole application : CGRA Two techniques to implement SIMD support efficiently for CGRAs. Efficient instruction fetch : Ring network + recycle buffer SIMD memory access : Split transaction + stride information in header Results in 3.4% power saving CGRAs with SIMD support improves overall performance by 17.6% with 16.9% less energy. 15
Questions? For more information http://cccp.eecs.umich.edu jasonjk@umich.edu 16
CGRA Memory Access Resolve bank conflicts through buffering Compiler accounts for additional buffering delay MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 17
Compilation Flow Loop Classification ILP Matching Acyclic Scheduling Program Loop Classification DLP ILP Matching SWP High ILP Low ILP Acyclic Scheduling Modulo Scheduling Code Generation Executable 18
Power Analysis - SIMD mode further saves power by 3.4%. (-) Savings from memory (+) Overheads from ring network, recycle buffer, and SIMD memory access - SIMD mode further saves power by 3.4%. 19
Resource Utilization in DLP Loops SIMD mode can utilize 13.6% more resources in DLP loops. - Compiler generates more efficient schedule with fewer resources. (Less routing, less exploration) 20
PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 21