CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Slides:

Advertisements

Similar presentations

Computer Organization, Bus Structure

Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Programmable logic and FPGA

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Chapter 6 Memory and Programmable Logic Devices

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Efficient FPGA Implementation of QR

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Chapter One Introduction to Pipelined Processors.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

CML RESIDUE NUMBER SYSTEM ENHANCEMENTS FOR PROGRAMMABLE PROCESSORS Arizona State University Rooju Chokshi 7 th November, 2008 Compiler-Microarchitecture.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 3 Computing System Fundamentals 3.2 Computer Architecture.

My Coordinates Office EM G.27 contact time:

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Scalable Register File Architectures for CGRA Accelerators

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CDA 3101 Spring 2016 Introduction to Computer Organization

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Anne Pratoomtong ECE734, Spring2002

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Multivector and SIMD Computers

1. Arizona State University, Tempe, USA

Mattan Erez The University of Texas at Austin

RAMP: Resource-Aware Mapping for CGRAs

Chapter 4 The Von Neumann Model

Presentation transcript:

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University, Tempe, Arizona, USA

CML Web page: aviral.lab.asu.edu CML Need for Power Efficient Computing 2  Power Efficient Computing Required at:  Micro-architecture level  Chip level  Data Center Level  Accelerators help achieve power efficient computing  Specialized Hardware for Application Specific Operations  Improve performance while reducing power  Scales from mobile devices to super computers  Qualcomm’s Adreno  Titan at Oak Ridge National Laboratory (NVIDIA Tesla)

CML Web page: aviral.lab.asu.edu CML Coarse-Grained Reconfigurable Architectures (CGRAs) 3  2D array of Processing Elements (PEs)  ALU + Local register file → PE  Mesh interconnection  Shared data bus  Data memory  PE inputs:  4 Neighboring PEs  Output Register (Self)  Local register file

CML Web page: aviral.lab.asu.edu CML a b c d e f g Time What to Map and How?

CML Web page: aviral.lab.asu.edu CML 5 What to Map and How? Time b cd e f g 2

CML Web page: aviral.lab.asu.edu CML Comparison with GPUs and FPGAs 6 GPGPUFPGACGRA Data Parallel Loops Non-Parallel Loops Setup Time

CML Web page: aviral.lab.asu.edu CML Time P 1 2 Q 1 2 Mapping w/o and with Registers for(…) { a=1; b=a-1; c=b*2; d=c-a; }

CML Web page: aviral.lab.asu.edu CML Rotating Register File Structure 8  Rotation performed every II cycles  Proposed in [Essen et al.] Divide by II clockIncrement every II cycles Offset Counter Modulo #regs Counter

CML Web page: aviral.lab.asu.edu CML Rotating Register File in Action 9  Rotation performed every II cycles

CML Web page: aviral.lab.asu.edu CML Introduction to the Problem 10  Loop kernels have constants  E.g., global variables, base address of arrays  Constants are difficult to propagate using Rotating Registers  We need both Rotating RF, as well as Non-Rotating RF in CGRA My Question: What kind of structures should we use to support both kind of RFs for CGRAs? How to efficiently utilize these structures in the compiler? Fixed RF Shared RF Programmable RF My Question: What kind of structures should we use to support both kind of RFs for CGRAs? How to efficiently utilize these structures in the compiler? Fixed RF Shared RF Programmable RF

CML Web page: aviral.lab.asu.edu CML State of the Art in RF Organizations for CGRA 11  Most work has explored how to use Rotating RFs in the CGRA  Architecture: RaPiD [Ebeling et al.], ADRES [Bouwens et al.]  Compiler: EMS [Hynchul et al.], [Sutter et al.], REGIMap [Hamzeh et al.]  Hardware impact of Shared Register File configurations studied in [Essen et al., Kwok et al.] in terms of  Degree of connectivity, Number of ports, Number of registers in the RF  Estimate the power, area and frequency of the designs  Compilation for Hybrid Architecture (with rotating and non-rotating RF) is largely overlooked

CML Web page: aviral.lab.asu.edu CML Fixed RF (FRF): Architecture 12  Physically partitioned into Rotating and Non-Rotating Region

CML Web page: aviral.lab.asu.edu CML Shared RF: Architecture 13  Rotating Registers per PE and Shared Non-Rotating Registers/Row

CML Web page: aviral.lab.asu.edu CML Pros and Cons of each RF Structure 14 Shared RFFixed RF Partition Fixed at design time Does not scale with the size of CGRA Scales with the size of CGRA Limited to 1 write/cycle/row All PEs in a row can write to their local RF Increase in Instruction SizeNo increase in Instruction Size High Area and Frequency Overhead Low Area and Frequency Overhead Desired RF Partition Reconfigurable at run-time Scales with the size of CGRA All PEs in a row can write to their local RF No increase in Instruction Size Minimal Area and Frequency Overhead

CML Web page: aviral.lab.asu.edu CML Increase in Instruction Size 15  Fixed RF  PEs input limited to neighbors, local RF and itself  No increase in Instruction word size  Shared RF  PEs now also read from a Shared RF  Need new bits in Instruction word to read from Shared RF  Increase in Instruction word size

CML Web page: aviral.lab.asu.edu CML Programmable RF (PRF): Architecture 16  Logical boundary between Rotating and Non-Rotating Region T T T

CML Web page: aviral.lab.asu.edu CML Pros and Cons of each RF Structure 17 Shared RFFixed RF Partition Fixed at design time Does not scale with the size of CGRA Scales with the size of CGRA Limited to 1 write/cycle/row All PEs in a row can write to their local RF Increase in Instruction SizeNo increase in Instruction Size No configuration instructions needed Maximum Area and Frequency Overhead Least Area and Frequency overhead 1 instance of a variable amongst PEs in the same row. Multiple copies of the same variable amongst PEs in the same row Programmable RF Reconfigurable at run-time Scales with the size of CGRA All PEs in a row can write to their local RF No increase in Instruction Size 1 configuration instruction/PE introduced Minimal Area and Frequency Overhead Multiple copies of the same variable amongst PEs in the same row

CML Web page: aviral.lab.asu.edu CML Compiler Support 18 for(…) { l=p1[i]; a[i]=l+d[i-2]; b=a[i]+1; c=b-1; d[i]=a[i]-c; p2[i]=d[i]; } 2 DFG contains:  Nodes:  Operation being performed  Type of operation [Arithmetic, Memory, Constant Operand]  Weighted Edges:  Depict data dependencies  Weight signifies loop carried dependency distance

CML Web page: aviral.lab.asu.edu CML Compiler Support 19  Easily integrates with any Mapping Algorithm.  Input: Kernel DFG and CGRA configuration  Output: A Valid Mapping of operations from the Input DFG to a Time Extended CGRA  Steps: do{ mapping = getMapping (DFG, CGRA) CheckRFconstraints(mapping, CGRA) if(RF Constraints violated) MappingNotFound }while(MappingNotFound);

CML Web page: aviral.lab.asu.edu CML Compiler Support 20 CheckRFconstraints(mapping, CGRA) { RFconstraintsSatisfied = true; for each operation op in mapping { pe_number = op.getMappedPE() usesRF = op.isRFUsed() if(usesRF) { usesNRF = op.isNRFUsed() writeToRegister = op.performRegisterWrite(); RFconstraintsSatisfied = CheckPRFconstraints(pe_number, usesNRF, writeToRegister) } if(RFconstraintsSatisfied == false) break; } return RFconstraintsSatisfied ; }

CML Web page: aviral.lab.asu.edu CML Compiler Support 21 CheckPRFconstraints(pe_number:int, usesNRF:boolean, writeToRegister:boolean) { if(writeToRegister == true) { if(PE[pe_number].utilizedRegisters >= CGRA.registersPerPE) return false; else PE[pe_number].utilizedRegisters++ } else //read from RF { if(usesNRF == false) //uses Rotating RF PE[pe_number].utilizedRegisters-- } return true; }

CML Web page: aviral.lab.asu.edu CML PRF is Best in a Register-Constrained CGRA (Lower II is better) 22 Mapping with 64 Registers Mapping with 32 Registers

CML Web page: aviral.lab.asu.edu CML PRF Requires the Minimum no. of Registers to get a Mapping 23

CML Web page: aviral.lab.asu.edu CML Number of registers required to achieve the minimum II 24

CML Web page: aviral.lab.asu.edu CML PRF Imposes Minimal Area Overhead 25 Courtesy: Mahdi Hamzeh and Shri Hari

CML Web page: aviral.lab.asu.edu CML PRF Imposes Minimal Frequency Overhead 26 Courtesy: Mahdi Hamzeh and Shri Hari

CML Web page: aviral.lab.asu.edu CML Conclusion 27  Coarse-Grained Reconfigurable Architectures are promising accelerators  Specially to speedup Non-Parallel Loops  Registers can be used to improve the performance of loops on CGRA  Most existing work focused on Rotating Registers with CGRA  However, both Rotating and Non-Rotating registers are needed for efficient execution of loops on CGRA  Programmable Register File provides the highest flexibility in terms of application mapping to CGRAs  Flexible partitioning between rotating and non-rotating regions  Can map with minimal number of registers  Scalable architecture  Minimal area and power overhead

CML Web page: aviral.lab.asu.edu CML SNRRF CONFIGURATION LIMITS MAPPING 28