Ph.D. in Computer Science

Slides:



Advertisements
Similar presentations
Goal: Split Compiler LLVM LLVM – DRESC bytecode staticdeployment time optimized architecture description compiler strategy ML annotations C code ADRES.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Scalable Register File Architectures for CGRA Accelerators
Dynamo: A Runtime Codesign Environment
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Please do not distribute
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Henk Corporaal TUEindhoven 2009
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Hyperthreading Technology
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Improving cache performance of MPEG video codec
EPIMap: Using Epimorphism to Map Applications on CGRAs
Milad Hashemi, Onur Mutlu, Yale N. Patt
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
1. Arizona State University, Tempe, USA
FPGA Interconnection Algorithm
HIGH LEVEL SYNTHESIS.
Fine-grained vs Coarse-grained multithreading
EECS 583 – Class 14 Modulo Scheduling Reloaded
Introduction to Heterogeneous Parallel Computing
Department of Electrical Engineering Joint work with Jiong Luo
RAMP: Resource-Aware Mapping for CGRAs
“Rate-Optimal” Resource-Constrained Software Pipelining
Introduction to Computer Systems Engineering
Research: Past, Present and Future
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Ph.D. in Computer Science   School of Computing, Informatics, and Decision Systems Engineering Compiler and Architecture Design for Coarse-Grained Programmable Accelerators Mahdi Hamzeh June 26, 2015  

Trends in Silicon Computing Heterogeneity Multi-cores Multi-cores Multi-threading Multi-threading Multi-threading μ-architecture μ-architecture μ-architecture μ-architecture Technology Technology Technology Technology Technology 6/26/15 Compiler and Architecture Design for CGRAs

Why Heterogonous Computing? Efficient Resource Allocation Based on Run-Time Info Each exhibit interesting feature for a class of computation Applications execute in phases Phase: a different class of computation A significant silicon area will be dark 1 Power GPU FPGA HP Core LP Core DSP HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs

HW Accelerators are Expensive! High design, test, verification cost HW ACC and FPGA Engineering cost Time to market HW ACC System Design Cost FPGA GPU DSP Building specialized HW ACC is expensive and time consuming HP Core LP Core Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs

HW Accelerators: Low Utilization, Limited Programmability Specialized for one application HW ACC Specialized for a class of computation DSP, GPU Run-time configuration overhead FPGA LP Core HP Core Flexibility FPGA GPU DSP HW ACC is only do well in one app, cannot use it in other app even if close computation class phase HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs

Software Programmable Accelerators: Opportunities and Challenges Programmability Compiler support: drives down costs HW ACC DSP GPU FPGA Performance Flexibility HP Core LP Core System Design Cost HP Core LP Core DSP HW ACC GPU FPGA Performance SW ACC SW acc to close cost gap SW ACC 6/26/15 Compiler and Architecture Design for CGRAs

Coarse-Grained Reconfigurable Architectures 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Designs in Literature ADRES 60 GOPS/w 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Designs in Literature TilePro64 192 GOPS @23W 6/26/15 Compiler and Architecture Design for CGRAs

Problems Addressed in this Dissertation CGRA Compiler Problems Problem Definition Complexity Analysis Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs

CGRA accelerates loops using modulo scheduling Execution Trace Target Application Specified in C Serial region Prolog Repetitive region Loop Serial region Epilog 6/26/15 Compiler and Architecture Design for CGRAs

II is the performance metric Modulo Scheduling Time 4 b 1 2 3 4 1 2 3 4 a 2 a a a a b b b b b b 1 2 3 4 1 2 3 4 c d 1 2 3 4 1 b II is the performance metric c c c c d d d d 1 2 3 4 1 2 3 4 f f f f e e e e 2 g g g g 1 2 3 4 1 2 3 4 3 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Modulo Scheduling: Problem Definition Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Modulo Scheduling: Problem Definition Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Problem Definition Important characteristics Routing, re-computing, or both EPIMorphism between computation graph and resource graph Identified the list of necessary conditions scheduled computation graph should hold Mapping is NP-Complete 3-partition problem 6/26/15 Compiler and Architecture Design for CGRAs

Problems Addressed in This Dissertation Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Modulo Scheduling Policies Brute Force Edge Centric Integrated Methods Node Centric Modulo Scheduling Policies Nature Inspired Existing literature addressing this problem using following policies Partitioning Decomposition methods Nature Inspired 6/26/15 Compiler and Architecture Design for CGRAs

Assumption and Limitations Memory miss, stop the execution A ld/st queue to resolve memory dependencies Support only single assignment instructions No system call No Function call Single exit condition 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs EPIMap Decomposition Scheduling Placement Constructive Evolve computation graph based on resource graph Adjust resource graph (MII) Efficient placement How we address it. Why we do it better? 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs EPIMap notable features and policies 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Re-Scheduling 6/26/15 Compiler and Architecture Design for CGRAs

Resource Allocation Problem 6/26/15 Compiler and Architecture Design for CGRAs

Resource Allocation: Supporting Multi-cycle Operation 6/26/15 Compiler and Architecture Design for CGRAs

Resource Allocation: Supporting Pipelined Resources f 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs

Rotating and Non-Rotating Register Files 6/26/15 Compiler and Architecture Design for CGRAs

Problems Addressed in This Dissertation Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation Control Flow Acceleration CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs

Control Flow Acceleration 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Partial Predication 3 a b c f e h et ef a b a b h a b h et ef c h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Full Predication b h a 4 a a b c f e h b h b h a c e b e e b a c e f 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Dual-Issue a b c f e h et ef a b c f h e 6/26/15 Compiler and Architecture Design for CGRAs

Mapping with Dual-Issue 2 b a b c f h e a b h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Hardware Support 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs CGRA Compiler Flow 6/26/15 Compiler and Architecture Design for CGRAs

State-of-the-art before EPIMap/REGIMap DRESC: A simulated annealing based mapping algorithm Integrated Mapping policy Supports multi-cycle operations Supports pipelined PEs Extended with register allocation Has been shown to generate mapping better than other mapping algorithms 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs EPIMap DRESC: Simulated annealing based MII= Min (ResMII, RecMII) 4 X 4 CGRA Mesh interconnect 1 cycle latency 6/26/15 Compiler and Architecture Design for CGRAs

Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs

Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs

Mapping and Register Allocation-Single Cycle 6/26/15 Compiler and Architecture Design for CGRAs

Mapping and Register Allocation-Pipelined PEs 6/26/15 Compiler and Architecture Design for CGRAs

Mapping and Register Allocation-Pipelined PEs 6/26/15 Compiler and Architecture Design for CGRAs

Summary of EPIMap/REGIMap vs. DRESC Performance Ratio Compilation Time Ratio Single cycle (NO-RA) 1.31X 138X Single cycle – 2 Regs 1.73X 240X Single cycle - 4 Regs 1.6X 209X Single cycle - 8 Regs 1.5X 163X Pipelined (NO-RA) 1.45X 192X Pipelined- 2 Regs 1.83X 317X Pipelined- 4 Regs 1.81X 289X Pipelined- 8 Regs 1.68X 227X 6/26/15 Compiler and Architecture Design for CGRAs

Mapping Loops With Conditional Instructions 6/26/15 Compiler and Architecture Design for CGRAs

CGRA Research Framework 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Summary Problem definition Supports routing Re-computation Complexity analysis Reduction from 3-partition problem Counter intuitive discovery, re-computation can improve performance Computation graph and necessary conditions EPIMap Approximate II progressively Effective iterative scheduling algorithm 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Summary Placement problem formulation Support of multi-cycle operations Support of pipelined resources Constructive method REGIMap Integrated placement and register allocation Support of conditionals Full predication Partial predication Dual-issue Integration with llvm compiler framework 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Summary CGRA design ISA Rotating and non-rotating register files Dual-issue support RTL implementation and synthesis CGRA simulation framework CGRA model in gem5 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Future Directions Support of system call Mapping with memory optimization Software prefetching in mapping Just-in-time compilation of kernels Offload decision at run-time Speculative execution support for CGRAs 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Backup 6/26/15 Compiler and Architecture Design for CGRAs

Backup-Scheduling Success 6/26/15 Compiler and Architecture Design for CGRAs

Clique-Resource Allocation Attempts 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs

Compiler and Architecture Design for CGRAs Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs