CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Computer Architecture Instruction-Level Parallel Processors

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Philipp Gysel ECE Department University of California, Davis

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Scalable Register File Architectures for CGRA Accelerators

Reza Yazdani Albert Segura José-María Arnau Antonio González

Ph.D. in Computer Science

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

nZDC: A compiler technique for near-Zero silent Data Corruption

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Improving java performance using Dynamic Method Migration on FPGAs

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

1. Arizona State University, Tempe, USA

Presentation transcript:

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen Prof. Yu (Kevin) Cao

CML Web page: aviral.lab.asu.edu CML Accelerators for Energy Efficiency 2  Demand for high performance at low power consumption.  Accelerators help achieve power efficient computing.  Specialized hardware to efficiently execute dominant computations of a program.  Scales from mobile devices to super computers Hardware accelerators General purpose processors FPGAs GPGPUs CGRAs goal Power Efficiency Flexibility Source : Fine and Coarse Grain Reconfigurable Computing, Springer.

CML Web page: aviral.lab.asu.edu CML Coarse-Grained Reconfigurable Architectures (CGRAs) 3  2D array of Processing Elements (PEs)  ALU + Local register files → PE  Torus interconnection Processor Accelerator Shared Memory

CML Web page: aviral.lab.asu.edu CML Acceleration of loops using CGRAs 4  Programs spend majority of their execution time in loops[2].  Research on CGRAs has been accelerating loops.  Acceleration of loops can result in faster execution time. for(…) { a = a + X; b = b - X; c = a * b; d = c - b; e = d + X; } [2]terative modulo scheduling: An algorithm for software pipelining loops,” in Proceedings of the 27th Annual International Symposium on Microarchitecture, ser. MICRO 27

CML Web page: aviral.lab.asu.edu CML Data Flow Graph Generation 5  Create a DFG from a simple loop kernel for(…) { a = a + X; b = b - X; c = a * b; d = c - b; e = d + X; }

CML Web page: aviral.lab.asu.edu CML Mapping DFG to CGRA 6 4 Time PE 1 PE 2 PE 3 PE 4

CML Web page: aviral.lab.asu.edu CML Mapping DFG to CGRA using Modulo Scheduling 7 Time PE 1 PE 2 PE 3 PE 4

CML Web page: aviral.lab.asu.edu CML One of the major challenges in CGRAs 8  How to efficiently accelerate execution of loops with if– then-else structures ?

CML Web page: aviral.lab.asu.edu CML Why accelerate loops with control flow ? [3]Branch-aware loop mapping on cgras,” in Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference, ser. DAC ’14.  40% of the loops that could be accelerated by CGRAs have control flow (if-then-else structures) in them in SPEC2006 benchmarks.[3]  50.1% of the instructions in a loop with control flow are in the conditional path on an average.  Relatively there are limited number of compiler solutions to accelerate loops with control flow.

CML Web page: aviral.lab.asu.edu CML Inefficiency of Existing Techniques 10  Firstly, instructions from both the paths of the branch are fetched and issued unconditionally to the CGRA. Partial predication Full predication Dual Issue II = 3 II = 5 ztzt zfzf If-path node else-path node

CML Web page: aviral.lab.asu.edu CML Inefficiency of Existing Techniques 11  Predicate value needs to be communicated to nodes handling instructions in control flow block. Partial predication Dual Issue II = 3 II = 5 Full predication

CML Web page: aviral.lab.asu.edu CML Proposed Solution: Path Selection based Branching 12  PSB executes the branch operation as early as possible  Communicate the branch outcome to the Instruction Fetch Unit.  Only the instructions from the path taken by the branch are issued to the CGRA.  Very much like how processors execute  However need compiler support in CGRAs

CML Web page: aviral.lab.asu.edu CML Arrangement of Instructions for PSB Approach 13

CML Web page: aviral.lab.asu.edu CML Architecture Support for PSB 14

CML Web page: aviral.lab.asu.edu CML What must the compiler do ? 15  Map operations from the if-path and else-path on the time extended CGRA.  The total number of PEs required execute the branch is the union of PEs required to map the if-path and else-path operations.  In order to improve the resource utilization each operation from the if-path must be “paired” with an operation from the else path and mapped to the same PE resource.

CML Web page: aviral.lab.asu.edu CML Pairing of operations 16 Achieved the lowest II so far !!

CML Web page: aviral.lab.asu.edu CML Why do we need to pair operations ? 17 If pairing is not done, the resources required to execute operations from the conditional path is the sum of the resources required to execute the if-path and the ellsepath. Such a mapping results in poor resource utilization.

CML Web page: aviral.lab.asu.edu CML Problem Formulation 18  Input: Data Flow Graph with if and else-path operations  Output: Data Flow Graph with fused nodes with each fused node having two operations – one from if-path and the other from else-path  Valid output: Such a transformation/pairing is valid iff the order of dependence of both the if-path operations and else-path operations are maintained in the dependence exhibited in the output.  Optimization: Minimize the number of nodes in the output Data Flow Graph while maintaining validity.

CML Web page: aviral.lab.asu.edu CML Are all possible pairings correct ? 19 Valid pairing: Invalid pairing:

CML Web page: aviral.lab.asu.edu CML Optimization: Minimize the number of nodes 20 Not Eligible:  We minimize the number of nodes by elimination of eligible Phi nodes. Eligible:

CML Web page: aviral.lab.asu.edu CML Our Heuristic 21

CML Web page: aviral.lab.asu.edu CML Performance Evaluation Model 22  CGRA is implemented in Gem5 system simulation framework.  We have integrated our PSB compiler technique as a separate pass in the LLVM compiler framework.  Computational loops with control ﬂow are extracted from SPEC2006, Biobench benchmarks.  We use REGIMap mapping algorithm to obtain a mapping for all approaches.  We map the loops on a 4 × 4 regular torus interconnected CGRA.

CML Web page: aviral.lab.asu.edu CML PSB achieves the best acceleration of loops 23  PSB achieves better acceleration (lower II) compared to existing techniques to accelerate control ﬂow loops

CML Web page: aviral.lab.asu.edu CML Why we are able to achieve better II? 24

CML Web page: aviral.lab.asu.edu CML Hardware Implementation 25  We implemented an RTL model of a 4x4 CGRA with torus interconnect network including the Instruction fetch unit for all CGRA architectures.  Synthesized using 65nm technology library using RTL compiler tool. The models were verified for functionality after synthesis.  To obtain the accurate impact of predicate communication in a PSB architecture on the overall frequency and area of CGRA, place and route was performed using Cadence Encounter tool.

CML Web page: aviral.lab.asu.edu CML PSB Architecture has comparable Area and Frequency 26  PSB Architecture has comparable Area, Frequency and Power with existing solutions. CGRA+IFU*Partial Predicatio n Full Predication Dual IssuePSB Area(sq.um) Frequency (MHz) *IFU – Instruction Fetch Unit

CML Web page: aviral.lab.asu.edu CML Energy Model 27  Total energy to execute the loop kernel = Energy spent by PE per cycle per kernel + dynamic energy spent on an instruction fetch operation per PE per kernel.  Energy spent by PE per cycle per kernel (estimated for ALU operation, routing operation and idle operation )  Energy expenditure for instruction access is estimated for each architecture from cacti 5.3 tool.

CML Web page: aviral.lab.asu.edu CML Relative Energy consumption 28  Relative energy consumption for executing the kernel of each benchmark relative to our PSB technique.

CML Web page: aviral.lab.asu.edu CML Conclusion 29  PSB issues instruction only from the path taken by the branch at run time.  Utilizes the branch outcome which is available at run time.  Alleviates the predicate communication overhead.  Achieves lower II.  Achieves better energy efficiency.

CML Web page: aviral.lab.asu.edu CML Publications 30  ShriHari RajendranRadhika, Aviral Shrivastava and Mahdi Hamzeh, “Path Selection Based Acceleration of Conditionals in CGRAs”, DATE 2015, (UNDER REVIEW). QUESTIONS ? ?

CML Web page: aviral.lab.asu.edu CML Back up slides 31

CML Web page: aviral.lab.asu.edu CML Percentage of instructions in the conditional path 32

CML Web page: aviral.lab.asu.edu CML Instruction memory overhead 33

CML Web page: aviral.lab.asu.edu CML Related Work 34  Control Flow execution is commonly handled by two techniques:  Predication:  In a predication scheme both paths of the branch are executed in parallel at run time.  Final result is selected between outputs of both paths based on the branch conditional’s outcome.  Dual issue(State of the art):  In dual scheme an instruction from if-path and else path is issued to a processing element.

CML Web page: aviral.lab.asu.edu CML Consider an example of a loop with control flow 35  SSA transformation

CML Web page: aviral.lab.asu.edu CML Partial Predication Scheme: 36  Need new DFG for loops with control flow  Add select instructions

CML Web page: aviral.lab.asu.edu CML Hardware Support 37

CML Web page: aviral.lab.asu.edu CML Obtained II after pairing of operations 38

CML Web page: aviral.lab.asu.edu CML Full Predication scheme: 39  Restriction in where the nodes updating the same variable can be mapped.

CML Web page: aviral.lab.asu.edu CML All PEs connected to IFU 40  Area =  Power = 141 mW  Frequency = 458 Mhz

CML Web page: aviral.lab.asu.edu CML Dual Issue Scheme(state of the art): 41  Create new DFG with packed nodes.  Better II than predication schemes.

CML Web page: aviral.lab.asu.edu CML Synthesis Incremental Optimization 42 area delay 0

CML Web page: aviral.lab.asu.edu CML IFU synthesis results 43

CML Web page: aviral.lab.asu.edu CML Algorithm: 44

CML Web page: aviral.lab.asu.edu CML Create fused nodes 45

CML Web page: aviral.lab.asu.edu CML Create DFG with fused nodes 46 Fused nodes

CML Web page: aviral.lab.asu.edu CML Mapping DFG onto a CGRA 47 Time

CML Web page: aviral.lab.asu.edu CML 2 Initiation Interval 48 4 Time