Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Slides:



Advertisements
Similar presentations
Goal: Split Compiler LLVM LLVM – DRESC bytecode staticdeployment time optimized architecture description compiler strategy ML annotations C code ADRES.
Advertisements

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Generic Software Pipelining at the Assembly Level Markus Pister
1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham.
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,
Scalable Register File Architectures for CGRA Accelerators
Ph.D. in Computer Science
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Michael Chu, Kevin Fan, Scott Mahlke
EPIMap: Using Epimorphism to Map Applications on CGRAs
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
1. Arizona State University, Tempe, USA
Tony Nowatzki∗ Newsha Ardalani† Karthikeyan Sankaralingam‡ Jian Weng∗
RAMP: Resource-Aware Mapping for CGRAs
The University of Adelaide, School of Computer Science
Presentation transcript:

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1

Coarse-Grained Reconfigurable Architecture (CGRA) FU LRF Array of PEs connected in a mesh-like interconnect Characterized by array size, node functionalities, interconnect, register file configurations Execute compute intensive kernels in multimedia applications 2

CGRA : Attractive Alternative to ASICs Suitable for running multimedia applications on embedded systems High computation throughput Low power consumption and scalability High flexibility with fast configuration Morphosys : 8x8 array with RISC processor SIMD style execution of loops Piperench : 1-D reconfigurable hardware Virtualize hardware pipeline ADRES : 8x8 array with tightly coupled VLIW Modulo scheduling with simulated annealing 3

Scheduling in CGRA Different from conventional VLIW Sparse interconnect and distributed register files No dedicated routing resources Need a good compiler to exploit the abundance of computing resources FU0 LRF FU1 LRF Central RF FU0 FU1 FU2 FU3 FU2 LRF FU3 LRF Conventional VLIW CGRA 4

Objectives of This Work Modulo scheduling technique for CGRAs Exploit loop-level parallelism by overlapping execution of iterations Targeting low-cost CGRAs Achieve quality schedule under restriction of hardware Fast compilation time 5

Modulo Scheduling Basics Expose loop-level parallelism by overlapping execution of iterations Initiation interval (II) Each iteration is executed every II cycles A B C A B C II A B C A B C Overlapped Execution 6

Modulo Scheduling for CGRA Mapping DFG onto 3-D scheduling space Limited number of scheduling slots : (number of PEs) x II Minimize routing cost (number of slots used for routing) Sparse interconnect and distributed register files Ensure routability of operands DFG II time Scheduling Space 4x4 CGRA 7

Our Approach Systematic approach to generate good schedule in reasonable time Minimize routing cost Convert scheduling problem into graph embedding Leverage graph embedding algorithm Ensure routability of operands Skewed scheduling space Create a narrow, but tall scheduling space 8

1 : Minimize Routing Cost Routing cost : number of PEs used for routing Determined by positions of producer and consumer Minimize distance between producers and consumers Height-based list scheduling Schedule operations in the order of dependence height Place consumers close to producers Need to carefully place operations in the same height 9

Scheduling Example – Routing Cost time PE 0 PE 1 PE 2 PE 3 1 2 3 1 2 3 1 2 3 4 5 4’ 5’ 4 5 6 6 Routing Cost = 2 time PE 0 PE 1 PE 2 PE 3 1 2 3 DFG 1 2 3 PE 0 PE 1 PE 2 PE 3 4 5 6 1x4 CGRA Routing Cost = 0 Common consumer information is important ! 10

Affinity Graph Heuristic Consider placement of operations with same height together Use common consumer information Affinity value between operations Measured by the distance of common consumers in DFG Construct affinity graph Nodes : operations, edges : affinity values Place operations with affinity edges close to each other 11

Affinity Graph Example 1 2 3 4 5 height 3 1 3 2 5 4 height 2 height 1 Affinity Graph DFG Mapping onto CGRA PE 1 3 2 5 4 1 3 2 5 4 2x4 CGRA Drawing affinity graph onto scheduling space Bad mapping Good mapping 12

Leveraging Graph Embedding Drawing a graph onto a target space Grid layout algorithm by Li & Kurata Embed complicated biochemical networks onto 2-D grid space Simulated annealing Our scheduling problem is a graph embedding problem Draw affinity graph onto scheduling space minimizing edge length Process Flow of Grid Layout [Li 2005] 13

2 : Ensure Routability of Operands Resources are repeatedly used every II cycles Routing can fail due to previously scheduled operations Backtracking : hard to make forward progress for CGRA Take preventative approach time PE 0 PE 1 PE 2 1 2 3 4 5 1 2 1 2 II 3 4 3 4 PE 0 PE 1 PE 2 5 6 5 6 7 1x3 CGRA 7 DFG Routing failed for Op 7 ! 14

Skewed Scheduling Space Should prevent routing failures in advance time PE 0 PE 1 PE 2 1 2 3 4 5 1 2 1 2 5 6 3 4 Skew scheduling space Staggering down to the right 7 Create a narrow, but tall scheduling space Operations can be routed to the right Dynamically adjust scheduling space 15

System Flow 16

Experimental Setup Twelve innermost loop kernels from various domains Three designs with different RF configurations Evaluate the impact of register file sharing Dedicated RF Shared RF Central RF 17

Evaluation of Affinity Heuristic Results of acyclic scheduling Average of 59% reduction in routing cost 18

Modulo Graph Embedding vs. Simulated Annealing Utilization = (# slots used for computation) / (# total slots) Time : (~ 5 sec) vs. (5 min ~ 3 hours) 19

Impact of Register File Configurations 20

Conclusions Modulo scheduler targeting low-cost CGRAs Provide high computation throughput, scalability, power efficiency Two heuristics to generate a good schedule Affinity graph heuristic Skewed scheduling space Average utilizations of 56-68% for three designs Systematic approach allows fast compilation time All benchmarks finished within 5s 21

Questions ? 22