Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Similar presentations


Presentation on theme: "Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can."— Presentation transcript:

1 Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can be achieved Has limitations on usability due to compiling difficulties For multi-threaded applications, need multi-threading capabilities in CGRA Propose a two-step Runtime methodology Non-restrictive compile-time constraints to schedule application into pages Dynamic transformation procedure to shrink/expand the resources used by a schedule Features: No additional hardware required Improved CGRA resource usage Improved system performance Publications: Enabling Multi-threading on CGRAs, in ICPP’11 Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems, Journal to be submitted to TECS Numbers ? 12/1/2018

2 Supervisory Committee:
Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems Jared Pager Supervisory Committee: Prof. Aviral Shrivastava (Chair) Prof. Sandeep Gupta Prof. Gil Speyer

3 Need for High Performance Embedded Systems
Advanced user-interface One-size-fits-all design Ubiquitous computing

4 Need for Power-efficient Embedded Systems
Battery Constraints Capacity is approaching a maximum value within a limited volume Cooling Constraints More cooling increases both volume and power consumption 12/1/2018

5 Accelerators : Can help achieve Power-efficient Performance
Power critical computations can be off-loaded to accelerators Perform application specific operations Achieve high throughput without loss of CPU programmability Existing examples Hardware Accelerator Intel SSE Reconfigurable Accelerator FPGA Graphics Accelerator nVIDIA Tesla (Fermi GPU) Technical Question: How is IBM cell an accelerator? Flow question: Having presented the existing set of processor accelerators that help achieve power-efficient performance, what is the motivation for yet another solution in this direction ? How do I highlight that CGRAs when used as accelerators help in this cause ? 12/1/2018

6 Coarse Grained Reconfigurable Array (CGRA): Power-efficient Accelerator
PEs communicate through an inter-connect network Distinguishing Characteristics Flexible programming ( a.c.t SSE) Power-efficient computing (a.c.t FPGAs) More general purpose (a.c.t GPUs) High performance Cons Compiling a program for CGRA difficult Not all applications are efficiently compiled No standard CGRA architecture Require extensive compiler support for general purpose computing Example CGRAs ADRES, MorphoSys, KressArray, RSPA, DART PE PE PE PE PE PE PE PE From Neighbors and Memory To Neighbors and Memory FU RF Local Instruction Memory PE PE PE PE PE PE PE PE Local Data Memory Main System Memory CGRAs have been shown to operate at 100s of GOps/W 12/1/2018

7 Mapping a Kernel onto a CGRA
Given the kernel’s DDG Unroll and Software Pipeline the loop for mapping on the given CGRA Schedule the unrolled loop for minimum II* (=1) Map nodes onto the PE array Dependent nodes closer to their sources Ensure dependent nodes have interconnects connecting sources Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Spatial Mapping & Temporal Scheduling PE 4i-2 1i 2i 3i-1 5i-2 6i-3 7i-4 8i-5 9i-6 4 1 2 3 5 6 7 8 9 1i, 2i, 3i-1, 4i-2, 5i-2, 6i-3, 7i-4, 8i-5, 9i-6 *Explained Later 12/1/2018

8 Mapped Kernel Executed on the CGRA
Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Execution time slot: (or cycle) 1 7 5 2 6 4 3 After cycle 6, one iteration of loop completes execution every cycle Data-Dependency Graph: 1 2 3 4 5 6 7 8 9 Entire kernel can be mapped onto CGRA by unrolling 6 times PE 43 15 25 34 53 62 71 80 42 14 24 33 52 61 70 40 12 22 31 50 45 17 27 36 55 64 73 82 91 11 21 30 10 20 41 13 23 32 51 60 44 16 26 35 54 63 72 81 90 Iteration Interval (II) is a measure of mapping quality Iteration Interval = 1 12/1/2018

9 Traditional Use of CGRAs
Stream Application E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Application Input Application Output Traditionally used for streaming applications The application kernel is mapped onto the CGRA System inputs are given to the application, one after the other The mapped kernel processes the inputs with improved power- efficiency Power-efficiency achieved is dependent on the Spatial and Temporal mapping by “compiler” 12/1/2018

10 Envisioned Use of CGRAs
Processor System Memory Data Buffer A A B B A = getMatrix(); B = getMatrix(); for (i = 0; i < size; i ++) for (j = 0; j < size; j ++) for (k = 0; k < size; k ++) C[j][i] += A[k][i] * B[j][k]; useMatrix(C); Program thread C Kernel to accelerate A = getMatrix(); B = getMatrix(); cgra_lib(schedule_1); useMatrix(C); A B C Specific kernels in a thread can be power/performance critical The kernel can be mapped and scheduled for execution on the CGRA Using the CGRA as a co-processor (accelerator) Power consuming processor execution time is saved Better performance of thread Overall system throughput is increased 12/1/2018

11 Compiler Flow Compiler Stages Enable Run Time Execution Identification
Code…code…code… Compiler Stages Identification Transformation Compilation Communication Enable Run Time Execution CGRA Driver to copy data and instructions Code…code…code… Code…code…code… …code…code…code… cgra_lib () GCC Assembler CGRA Mapper/ Compiler CPU Binary CGRA Binary

12 Need of Multi-threading
Application: Single thread Entire CGRA used to schedule each kernel of the thread Only a single thread is accelerated at a time. Application: Multiple threads Entire CGRA is used to accelerate each individual kernel If multiple threads require simultaneous acceleration Threads must be stalled Kernels are queued to be run on the CGRA Existing CGRA compilers can be used for efficient single-threaded use Not all PEs are used in each schedule. Thread-stalls create a performance bottleneck S3 S2 S1 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 12/1/2018

13 Proposed Multi-threading Solution
Through program compilation and scheduling Map application onto groups of PEs (abstract software view) Facilitate multiple kernel schedules to execute simultaneously Enable runtime multi-threading w/o re-compilation Shrink and Expand multiple schedules at runtime Advantage: Multi-threaded system throughput improved Overall power consumption reduced Thread: 3 Schedule Expanded to increase performance Threads: 2, 3 Expand to maximize CGRA utilization and performance Threads: 1, 2, 3 Shrink-to-fit mapping maximizing performance Threads: 1, 2 Maximum CGRA utilization S2 S3 S3 S3 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 S3’ S2’ S1 S3 12/1/2018

14 Our Multithreading Technique
Static compile-time constraints to enable schedule transformations Has minimal effect on overall performance (II) May increase compile-time Fast runtime transformations Linear time to complete All schedules treated independently Features: Runtime Multithreading enabled in linear runtime No additional hardware modifications Works with current CGRA mapping algorithms Algorithm must allow for custom PE interconnects Experimentally demonstrated using EMS Confirm and clarify the features 12/1/2018

15 Step 1 Compiler Constraints: Hardware Abstraction CGRA Paging
Page: software perspective grouping of PEs A page has symmetrical connections to each of the neighboring pages No additional hardware ‘modification’ is required. Page-level interconnects follow a ring topology Clock-wise (or) counter clock-wise P0 P1 P2 P3 P0 P3 P2 P1 Local Instruction Memory Main System Memory Local Data Memory P0 P3 P2 P1 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 PE Page definition clarify and confirm 12/1/2018

16 Step 1 Compiler Constraints: Mapping Kernel onto Pages
Compile-time Constraints CGRA is collection of pages Each page can interact with only one topologically neighboring page. Inter-PE connections within a page are unmodified Data flow of kernel is maintained across pages through topological assignment of page schedules Multiple solutions exist for each kernel that provide equal performance Naïve mapping could result in under-used CGRA resources P0 P3 4 1 2 3 5 6 7 8 9 4 1 2 3 5 6 7 8 9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 Aggregates the CGRA PEs used to make room for additional schedules to execute simultaneously. Constraining the EMS mapping algorithm to a page of less number of pages, may not be optimal, but still will help improve resource usage. P1 P2 12/1/2018

17 Step 2: Runtime Transformation enabling Multi-threading in CGRA
Example: Application mapped to 3 pages Shrink to execute on 1 page Transformation Procedure: Isolate the mapped schedule Split pages in topological order Constraints inter-page dependencies should be maintained at all instances P3 P0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P1 4 5 6 7 8 9 P2 12/1/2018

18 Step 2: Runtime Transformation enabling Multi-threading in CGRA
Transformation Procedure: Isolate the mapped schedule Split pages in topological order Executed schedule on modified time- schedules (only 1 page) Mirror pages to facilitate shrinking (To ensure inter-node dependency) P0 1 2 3 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 7 8 9 P0,2 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,1 4 5 6 e0 e1 e4 e5 No CGRA interconnect to feed output to 7 P0,1 4 5 6 P1 4 5 6 e8 e9 e12 e13 t0 t1 t2 P0 1 2 3 P1 4 5 6 7 8 9 P2 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,0 1 2 3 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 P0,1 4 5 6 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 7 8 9 P0,2 7 8 9 P0,2 7 8 9 P2 7 8 9 P2 e10 e11 e14 e15 To be fixed 12/1/2018

19 Experimental Evaluation
CGRA-side Analysis Compiler constraints are non-restrictive Improved Performance by Multi- threading System Case Study 12/1/2018

20 Constraint Cost Analysis
CGRA Configurations used: 4x4, 6x6, 8x8 Page configurations: 2, 4, 8 PEs per page 20 Kernels Single Threading Only Experiment Compile kernels for the CGRA EMS Without constraints EMS + our compiler constraints Compare kernel II’s (performance) to determine cost of enabling multi-threading 12/1/2018

21 Low Cost for Enabling Multi-threading
A correctly chosen page size shows minimal cost for enabling multi-threading Page size greatly affects performance A poorly chosen page size can prevent successful mapping Redo 12/1/2018

22 Multi-threading Benefit Analysis
CGRA Configurations used: 4x4, 6x6, 8x8 Page configurations: 2, 4, 8 PEs per page 20 Kernels Multi-threaded workload Experiment Compare throughput of CGRA only for multiple threads Single-threaded CGRA Paging CGRA enabling multi-threading Compare to determine benefits of enabling multi-threading Performance ≈ Performance*/Page * Number of Pages in CGRA Structure Performance improvement numbers ?? 12/1/2018

23 CGRA Throughput Increase
Page size plays a crucial role in multi-threading performance Performance benefits of multi-threading increase with CGRA size, regardless of page size Performance improvement numbers ?? Orig = Performance of Single-threaded CGRA New = Performance of Multi-threaded CGRA 12/1/2018

24 Experimental Evaluation
CGRA-side Analysis Compiler constraints are non-restrictive Improved Performance by Multi-threading System Case Study CGRAs improve overall system performance Multi-threaded CGRAs improve multi- threaded workloads 12/1/2018

25 Experimental Setup Multiple Threads, Single Thread Data Buffer
Single-threaded CGRA CPU/Serial Code CGRA Code Thread 1 Thread 2 CGRA Sizes 4x4, 6x6, 8x8 Page Size 4 PEs Threads Composed of serial and CGRA-eligible code Number of Threads 1, 2, 4, 8, or 16 Data Buffer 64 KB Systems DMA-constrained CGRA System Non-DMA-constrained CGRA System CPU-only System Stalled Stalled CPU System Memory CGRA Buffer DMA Thread 2 Thread 1 Add multi-threading Thread 1 12/1/2018

26 Experimental Setup Multiple Threads, Data Buffer Thread 1 Thread 2
Multi-threading CGRA Thread 1 Thread 2 CGRA Sizes 4x4, 6x6, 8x8 Page Size 4 PEs Threads Composed of serial and CGRA-eligible code Number of Threads 1, 2, 4, 8, or 16 Data Buffer 64 KB Systems DMA-constrained CGRA System Non-DMA-constrained CGRA System CPU-only System Stalled CPU System Memory CGRA Buffer DMA Thread 1 Add multi-threading Thread 2

27 CGRAs are suitable as a co-processor
Performance is lost as number of threads increase, especially as more CPU cores exist CGRAs provide significant performance improvement Implicit resource parallelization between the CGRA and Processor provides improved performance Compare to single CPU Orig = 1/Run Time of Single CPU System New = 1/Run Time of Single-threaded CGRA System 12/1/2018

28 Multi-threading appropriate and beneficial for CGRAs
Simulation shows single-threaded performance unaffected for enabling mult-threading Larger CGRA structures provide more benefit from enabling multi-threading Performance improvement numbers ?? Orig = 1/Run Time of Single CPU System New = 1/Run Time of Multi-threaded CGRA 12/1/2018

29 Creating a Power-efficient System
Original System: 4 CPUs, 2.5 GHz 500 MHz 8x8 CGRA 8 GB/s DMA Bandwidth Optimized System 2 CPUs, 800 MHz 600 MHz 8x8 CGRA 4 GB/s DMA Bandwidth Multi-threaded CGRA maintains superior performance as threading increases Single-threaded CGRA does not scale in performance with threading Compare to single CPU Orig = 1/Run Time of Single 2.5 GHz CPU New = 1/Run Time of Described System 12/1/2018

30 Stall Percentages (DMA-constrained System)
Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

31 Stall Percentages (Non-DMA-constrained System)
Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

32 Stall Percentages (Optimized System)
Single-threaded CGRA Multi-threaded CGRA 4x4 CGRA 6x6 CGRA 8x8 CGRA 12/1/2018

33 Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can be achieved Has limitations on usability due to compiling difficulties For multi-threaded applications, need multi-threading capabilities in CGRA Propose a two-step Runtime methodology Non-restrictive compile-time constraints to schedule application into pages Dynamic transformation procedure to shrink/expand the resources used by a schedule Features: No additional hardware required Improved CGRA resource usage Improved system performance Publications: Enabling Multi-threading on CGRAs, in ICPP’11 Increasing CGRA Utilization through Multi-threading for Power-efficient Embedded Systems, Journal to be submitted to TECS Numbers ? 12/1/2018

34 Future Work Enabling multi-threading capabilities for other kernel mapping algorithms and CGRA structures Sophisticated runtime scheduling technique to maximize system throughput, which includes: Thread priority Work load Memory bandwidth considerations 12/1/2018

35 Thank you ! For more information: 12/1/2018


Download ppt "Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can."

Similar presentations


Ads by Google