Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Extracted directly from:
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Full and Para Virtualization
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Computer System Structures
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Gwangsun Kim, Jiyun Jeong, John Kim
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Employing compression solutions under openacc
Improved Resource Sharing for FPGA DSP Blocks
Seth Pugsley, Jeffrey Jestes,
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
Microarchitecture.
Ph.D. in Computer Science
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Many-core Software Development Platforms
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Introduction to cosynthesis Rabi Mahapatra CSCE617
Instruction Scheduling for Instruction-Level Parallelism
EPIMap: Using Epimorphism to Map Applications on CGRAs
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Compiler Back End Panel
Compiler Back End Panel
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
1. Arizona State University, Tempe, USA
Gary M. Zoppetti Gagan Agrawal
Multithreaded Programming
Final Project presentation
Operating System Introduction.
ARM ORGANISATION.
Maximizing Speedup through Self-Tuning of Processor Allocation
Operating System Overview
6- General Purpose GPU Programming
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can be achieved Has limitations on usability due to compiling difficulties For multi-threaded applications, need multi-threading capabilities in CGRA Propose a two-step Runtime methodology Non-restrictive compile-time constraints to schedule application into pages Dynamic transformation procedure to shrink/expand the resources used by a schedule Features: No additional hardware required Improved CGRA resource usage Improved system performance Publications: Enabling Multi-threading on CGRAs, in ICPP’11 Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems, Journal to be submitted to TECS Numbers ? 12/1/2018

Supervisory Committee: Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems Jared Pager Supervisory Committee: Prof. Aviral Shrivastava (Chair) Prof. Sandeep Gupta Prof. Gil Speyer

Need for High Performance Embedded Systems Advanced user-interface One-size-fits-all design Ubiquitous computing

Need for Power-efficient Embedded Systems Battery Constraints Capacity is approaching a maximum value within a limited volume Cooling Constraints More cooling increases both volume and power consumption 12/1/2018

Accelerators : Can help achieve Power-efficient Performance Power critical computations can be off-loaded to accelerators Perform application specific operations Achieve high throughput without loss of CPU programmability Existing examples Hardware Accelerator Intel SSE Reconfigurable Accelerator FPGA Graphics Accelerator nVIDIA Tesla (Fermi GPU) Technical Question: How is IBM cell an accelerator? Flow question: Having presented the existing set of processor accelerators that help achieve power-efficient performance, what is the motivation for yet another solution in this direction ? How do I highlight that CGRAs when used as accelerators help in this cause ? 12/1/2018

Coarse Grained Reconfigurable Array (CGRA): Power-efficient Accelerator PEs communicate through an inter-connect network Distinguishing Characteristics Flexible programming ( a.c.t SSE) Power-efficient computing (a.c.t FPGAs) More general purpose (a.c.t GPUs) High performance Cons Compiling a program for CGRA difficult Not all applications are efficiently compiled No standard CGRA architecture Require extensive compiler support for general purpose computing Example CGRAs ADRES, MorphoSys, KressArray, RSPA, DART PE PE PE PE PE PE PE PE From Neighbors and Memory To Neighbors and Memory FU RF Local Instruction Memory PE PE PE PE PE PE PE PE Local Data Memory Main System Memory CGRAs have been shown to operate at 100s of GOps/W 12/1/2018

Mapping a Kernel onto a CGRA Given the kernel’s DDG Unroll and Software Pipeline the loop for mapping on the given CGRA Schedule the unrolled loop for minimum II* (=1) Map nodes onto the PE array Dependent nodes closer to their sources Ensure dependent nodes have interconnects connecting sources Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Spatial Mapping & Temporal Scheduling PE 4i-2 1i 2i 3i-1 5i-2 6i-3 7i-4 8i-5 9i-6 4 1 2 3 5 6 7 8 9 1i, 2i, 3i-1, 4i-2, 5i-2, 6i-3, 7i-4, 8i-5, 9i-6 *Explained Later 12/1/2018

Mapped Kernel Executed on the CGRA Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Execution time slot: (or cycle) 1 7 5 2 6 4 3 After cycle 6, one iteration of loop completes execution every cycle Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Entire kernel can be mapped onto CGRA by unrolling 6 times PE 43 15 25 34 53 62 71 80 42 14 24 33 52 61 70 40 12 22 31 50 45 17 27 36 55 64 73 82 91 11 21 30 10 20 41 13 23 32 51 60 44 16 26 35 54 63 72 81 90 Iteration Interval (II) is a measure of mapping quality Iteration Interval = 1 12/1/2018

Traditional Use of CGRAs Stream Application E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Application Input Application Output Traditionally used for streaming applications The application kernel is mapped onto the CGRA System inputs are given to the application, one after the other The mapped kernel processes the inputs with improved power- efficiency Power-efficiency achieved is dependent on the Spatial and Temporal mapping by “compiler” 12/1/2018

Envisioned Use of CGRAs Processor System Memory Data Buffer A A B B A = getMatrix(); B = getMatrix(); for (i = 0; i < size; i ++) for (j = 0; j < size; j ++) for (k = 0; k < size; k ++) C[j][i] += A[k][i] * B[j][k]; useMatrix(C); Program thread C Kernel to accelerate A = getMatrix(); B = getMatrix(); cgra_lib(schedule_1); useMatrix(C); A B C Specific kernels in a thread can be power/performance critical The kernel can be mapped and scheduled for execution on the CGRA Using the CGRA as a co-processor (accelerator) Power consuming processor execution time is saved Better performance of thread Overall system throughput is increased 12/1/2018

Compiler Flow Compiler Stages Enable Run Time Execution Identification Code…code…code… Compiler Stages Identification Transformation Compilation Communication Enable Run Time Execution CGRA Driver to copy data and instructions Code…code…code… Code…code…code… …code…code…code… cgra_lib () GCC Assembler CGRA Mapper/ Compiler CPU Binary . CGRA Binary

Need of Multi-threading Application: Single thread Entire CGRA used to schedule each kernel of the thread Only a single thread is accelerated at a time. Application: Multiple threads Entire CGRA is used to accelerate each individual kernel If multiple threads require simultaneous acceleration Threads must be stalled Kernels are queued to be run on the CGRA Existing CGRA compilers can be used for efficient single-threaded use Not all PEs are used in each schedule. Thread-stalls create a performance bottleneck S3 S2 S1 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 12/1/2018

Proposed Multi-threading Solution Through program compilation and scheduling Map application onto groups of PEs (abstract software view) Facilitate multiple kernel schedules to execute simultaneously Enable runtime multi-threading w/o re-compilation Shrink and Expand multiple schedules at runtime Advantage: Multi-threaded system throughput improved Overall power consumption reduced Thread: 3 Schedule Expanded to increase performance Threads: 2, 3 Expand to maximize CGRA utilization and performance Threads: 1, 2, 3 Shrink-to-fit mapping maximizing performance Threads: 1, 2 Maximum CGRA utilization S2 S3 S3 S3 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 S3’ S2’ S1 S3 12/1/2018

Our Multithreading Technique Static compile-time constraints to enable schedule transformations Has minimal effect on overall performance (II) May increase compile-time Fast runtime transformations Linear time to complete All schedules treated independently Features: Runtime Multithreading enabled in linear runtime No additional hardware modifications Works with current CGRA mapping algorithms Algorithm must allow for custom PE interconnects Experimentally demonstrated using EMS Confirm and clarify the features 12/1/2018

Step 1 Compiler Constraints: Hardware Abstraction CGRA Paging Page: software perspective grouping of PEs A page has symmetrical connections to each of the neighboring pages No additional hardware ‘modification’ is required. Page-level interconnects follow a ring topology Clock-wise (or) counter clock-wise P0 P1 P2 P3 P0 P3 P2 P1 Local Instruction Memory Main System Memory Local Data Memory P0 P3 P2 P1 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 PE Page definition clarify and confirm 12/1/2018

Step 1 Compiler Constraints: Mapping Kernel onto Pages Compile-time Constraints CGRA is collection of pages Each page can interact with only one topologically neighboring page. Inter-PE connections within a page are unmodified Data flow of kernel is maintained across pages through topological assignment of page schedules Multiple solutions exist for each kernel that provide equal performance Naïve mapping could result in under-used CGRA resources P0 P3 4 1 2 3 5 6 7 8 9 4 1 2 3 5 6 7 8 9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 Aggregates the CGRA PEs used to make room for additional schedules to execute simultaneously. Constraining the EMS mapping algorithm to a page of less number of pages, may not be optimal, but still will help improve resource usage. P1 P2 12/1/2018

Step 2: Runtime Transformation enabling Multi-threading in CGRA Example: Application mapped to 3 pages Shrink to execute on 1 page Transformation Procedure: Isolate the mapped schedule Split pages in topological order Constraints inter-page dependencies should be maintained at all instances P3 P0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P1 4 5 6 7 8 9 P2 12/1/2018

Step 2: Runtime Transformation enabling Multi-threading in CGRA Transformation Procedure: Isolate the mapped schedule Split pages in topological order Executed schedule on modified time- schedules (only 1 page) Mirror pages to facilitate shrinking (To ensure inter-node dependency) P0 1 2 3 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 7 8 9 P0,2 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,1 4 5 6 e0 e1 e4 e5 No CGRA interconnect to feed output to 7 P0,1 4 5 6 P1 4 5 6 e8 e9 e12 e13 t0 t1 t2 P0 1 2 3 P1 4 5 6 7 8 9 P2 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,1 4 5 6 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 7 8 9 P0,2 7 8 9 P0,2 7 8 9 P2 7 8 9 P2 e10 e11 e14 e15 To be fixed 12/1/2018

Experimental Evaluation CGRA-side Analysis Compiler constraints are non-restrictive Improved Performance by Multi- threading System Case Study 12/1/2018

Constraint Cost Analysis CGRA Configurations used: 4x4, 6x6, 8x8 Page configurations: 2, 4, 8 PEs per page 20 Kernels Single Threading Only Experiment Compile kernels for the CGRA EMS Without constraints EMS + our compiler constraints Compare kernel II’s (performance) to determine cost of enabling multi-threading 12/1/2018

Low Cost for Enabling Multi-threading A correctly chosen page size shows minimal cost for enabling multi-threading Page size greatly affects performance A poorly chosen page size can prevent successful mapping Redo 12/1/2018

Multi-threading Benefit Analysis CGRA Configurations used: 4x4, 6x6, 8x8 Page configurations: 2, 4, 8 PEs per page 20 Kernels Multi-threaded workload Experiment Compare throughput of CGRA only for multiple threads Single-threaded CGRA Paging CGRA enabling multi-threading Compare to determine benefits of enabling multi-threading Performance ≈ Performance*/Page * Number of Pages in CGRA Structure Performance improvement numbers ?? 12/1/2018

CGRA Throughput Increase Page size plays a crucial role in multi-threading performance Performance benefits of multi-threading increase with CGRA size, regardless of page size Performance improvement numbers ?? Orig = Performance of Single-threaded CGRA New = Performance of Multi-threaded CGRA 12/1/2018

Experimental Evaluation CGRA-side Analysis Compiler constraints are non-restrictive Improved Performance by Multi-threading System Case Study CGRAs improve overall system performance Multi-threaded CGRAs improve multi- threaded workloads 12/1/2018

Experimental Setup Multiple Threads, Single Thread Data Buffer Single-threaded CGRA CPU/Serial Code CGRA Code Thread 1 Thread 2 CGRA Sizes 4x4, 6x6, 8x8 Page Size 4 PEs Threads Composed of serial and CGRA-eligible code Number of Threads 1, 2, 4, 8, or 16 Data Buffer 64 KB Systems DMA-constrained CGRA System Non-DMA-constrained CGRA System CPU-only System Stalled Stalled CPU System Memory CGRA Buffer DMA Thread 2 Thread 1 Add multi-threading Thread 1 12/1/2018

Experimental Setup Multiple Threads, Data Buffer Thread 1 Thread 2 Multi-threading CGRA Thread 1 Thread 2 CGRA Sizes 4x4, 6x6, 8x8 Page Size 4 PEs Threads Composed of serial and CGRA-eligible code Number of Threads 1, 2, 4, 8, or 16 Data Buffer 64 KB Systems DMA-constrained CGRA System Non-DMA-constrained CGRA System CPU-only System Stalled CPU System Memory CGRA Buffer DMA Thread 1 Add multi-threading Thread 2

CGRAs are suitable as a co-processor Performance is lost as number of threads increase, especially as more CPU cores exist CGRAs provide significant performance improvement Implicit resource parallelization between the CGRA and Processor provides improved performance Compare to single CPU Orig = 1/Run Time of Single CPU System New = 1/Run Time of Single-threaded CGRA System 12/1/2018

Multi-threading appropriate and beneficial for CGRAs Simulation shows single-threaded performance unaffected for enabling mult-threading Larger CGRA structures provide more benefit from enabling multi-threading Performance improvement numbers ?? Orig = 1/Run Time of Single CPU System New = 1/Run Time of Multi-threaded CGRA 12/1/2018

Creating a Power-efficient System Original System: 4 CPUs, 2.5 GHz 500 MHz 8x8 CGRA 8 GB/s DMA Bandwidth Optimized System 2 CPUs, 800 MHz 600 MHz 8x8 CGRA 4 GB/s DMA Bandwidth Multi-threaded CGRA maintains superior performance as threading increases Single-threaded CGRA does not scale in performance with threading Compare to single CPU Orig = 1/Run Time of Single 2.5 GHz CPU New = 1/Run Time of Described System 12/1/2018

Stall Percentages (DMA-constrained System) Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

Stall Percentages (Non-DMA-constrained System) Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

Stall Percentages (Optimized System) Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can be achieved Has limitations on usability due to compiling difficulties For multi-threaded applications, need multi-threading capabilities in CGRA Propose a two-step Runtime methodology Non-restrictive compile-time constraints to schedule application into pages Dynamic transformation procedure to shrink/expand the resources used by a schedule Features: No additional hardware required Improved CGRA resource usage Improved system performance Publications: Enabling Multi-threading on CGRAs, in ICPP’11 Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems, Journal to be submitted to TECS Numbers ? 12/1/2018

Future Work Enabling multi-threading capabilities for other kernel mapping algorithms and CGRA structures Sophisticated runtime scheduling technique to maximize system throughput, which includes: Thread priority Work load Memory bandwidth considerations 12/1/2018

Thank you ! For more information: http://aviral.lab.asu.edu 12/1/2018