Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

The University of Adelaide, School of Computer Science

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Microarchitecture.

Adaptive Cache Partitioning on a Composite Core

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Anne Pratoomtong ECE734, Spring2002

Pipelining and Vector Processing

Presented by: Isaac Martin

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

The Vector-Thread Architecture

Presentation transcript:

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10, 2013 1

Augmented Reality Physical world + Computer generated inputs Commerce 2

Augmented Reality Physical world + Computer generated inputs Commerce Information 3

Augmented Reality Physical world + Computer generated inputs Commerce Information Games Compared to multimedia applications, User interactive Computationally intensive 4

Application Characteristics 69% in data parallel loops (DLP loops) => SIMD / Coarse-Grained Reconfigurable Architecture (CGRA) 15% in software pipelinable loops (SWP loops) => CGRA Feature Extracting Kernels Virtual Object Rendering Kernel Video Conferencing with Virtual Object Manipulation 5

SIMD vs. CGRA SIMD Identical lanes Shared instruction fetch (Same schedule across PEs) SIMD memory access PE# : SIMD lane instruction PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 6

SIMD vs. CGRA Homogeneous CGRA Heterogeneous CGRA Identical units Mesh-like interconnects Software pipelining Heterogeneous CGRA More energy efficient than homogeneous CGRA Less performance compared to homogeneous CGRA Software pipelining PE0 PE1 PE2 PE3 PE# : Processing Element with All Units PE4 PE5 PE6 PE7 PE# : Processing Element with Multipliers PE8 PE9 PE10 PE11 PE# : Processing Element with Memory Units PE# : Processing Element without Complex Units PE12 PE13 PE14 PE15 7

SIMD vs. CGRA In DLP loops, SIMD > CGRA In total execution time and energy, CGRA > SIMD (due to SWP loops) In energy consumption, heterogeneous CGRA > homogeneous CGRA (20% less energy with only 4% performance loss) 8

Adding SIMD Support for CGRA Heterogeneous CGRA Grouping multiple PEs to form an identical SIMD core PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 SIMD Core PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 How do we obtain the efficiency of single instruction fetch? How do we achieve the efficiency of SIMD memory access? 9

Efficient Instruction Fetch Fetch instruction once from memory Pass around the instruction to the next SIMD core Last SIMD core stores the instruction in a recycle buffer SIMD Core 0 SIMD Core 3 iteration 0 iteration 4 iteration 3 SIMD Core 1 SIMD Core 2 iteration 1 iteration 2 10

SIMD Memory Access Single memory request, multiple responses Split transaction Enables forwarding (Request ID) SIMD mode flag, stride information Split transaction Enables forwarding (Request ID) MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 11

Experimental Setup Baseline Our solution Compiler Power Heterogeneous CGRA with 16 PEs 4 PEs with memory units, 4 PEs with multipliers Our solution Baseline + SIMD support 1 cycle latency ring network, 16-entry recycle buffer Compiler IMPACT frontend compiler Edge-centric modulo scheduler ADRES framework Power 65nm technology @ 200MHz/1V CACTI 12

Evaluation for DLP Loops 16 cores (SIMD) vs. 4 SIMD cores (Our solution) ILP within the loops - Our solution is 14.1% slower compared to SIMD. - Our solution achieves nearly the same energy efficiency as SIMD. 13

Evaluation for Total Execution Our solution achieves 17.6% speedup with 16.9% less energy compared to baseline heterogeneous CGRA. 14

Conclusion Best performing / energy-efficient solution DLP loops : SIMD Whole application : CGRA Two techniques to implement SIMD support efficiently for CGRAs. Efficient instruction fetch : Ring network + recycle buffer SIMD memory access : Split transaction + stride information in header Results in 3.4% power saving CGRAs with SIMD support improves overall performance by 17.6% with 16.9% less energy. 15

Questions? For more information http://cccp.eecs.umich.edu jasonjk@umich.edu 16

CGRA Memory Access Resolve bank conflicts through buffering Compiler accounts for additional buffering delay MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 17

Compilation Flow Loop Classification ILP Matching Acyclic Scheduling Program Loop Classification DLP ILP Matching SWP High ILP Low ILP Acyclic Scheduling Modulo Scheduling Code Generation Executable 18

Power Analysis - SIMD mode further saves power by 3.4%. (-) Savings from memory (+) Overheads from ring network, recycle buffer, and SIMD memory access - SIMD mode further saves power by 3.4%. 19

Resource Utilization in DLP Loops SIMD mode can utilize 13.6% more resources in DLP loops. - Compiler generates more efficient schedule with fewer resources. (Less routing, less exploration) 20

PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 21