University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

DSPs Vs General Purpose Microprocessors

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

SAGE: Self-Tuning Approximation for Graphics Engines

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

EKT303/4 Superscalar vs Super-pipelined.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Microarchitecture.

Adaptive Cache Partitioning on a Composite Core

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Lynn Choi School of Electrical Engineering

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Anne Pratoomtong ECE734, Spring2002

Improved schedulability on the ρVEX polymorphic VLIW processor

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Introduction to Heterogeneous Parallel Computing

The Vector-Thread Architecture

Overview Prof. Eric Rotenberg

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability Yongjun Park 1, Jason Jong Kyu Park 1, Hyunchul Park 2, and Scott Mahlke 1 December 3, University of Michigan, Ann Arbor 2 Programming Systems Lab, Intel Labs, Santa Clara, CA

University of Michigan Electrical Engineering and Computer Science Convergence of Functionalities 2 Convergence of functionalities demands a flexible solution due to the design cost and programmability Anatomy of an iPhone4 4G Wireless Navigation Audio Video 3D Flexible Accelerator!

University of Michigan Electrical Engineering and Computer Science Mixture of ILP/DLP legacy workloads media processing web browsing scientific computing wireless communication Image processing Current Mobile Solutions & Challenges 3 Good for ILPGood for DLP 1.6 GHz ARM Cortex-A9 ULP GeForce 1.7 GHz Krait Adreno GHz ARM Cortex-A9 ARM Mali-400 MP4 ILP-based DLP-based Goal: Design of a unified accelerator with: 1. Scalability 2. Flexible execution support 3. Energy efficiency

University of Michigan Electrical Engineering and Computer Science Traditional Homogeneous SIMD 4  Standard high performance machine for embedded systems  Industry: IBM Cell, ARM NEON, Intel MIC, etc.  Research: SODA, AnySp, etc.  Advantage  High throughput  Low fetch-decode overhead  Easy to scale  Disadvantage  Hard to realize high resource utilization Example SIMD machine: 100 MOps /mW Advanced goal: map broader range of applications into SIMD!

University of Michigan Electrical Engineering and Computer Science Exploration of Low Resource Utilization 5 AAC decoder High execution ratio on high data-parallel loops (~80%) Traditional wide SIMD accelerator is frequently over-designed The performance is limited by the non-high-DLP loops Loop Execution Time 1-issue in-order core Input for ( …… ) { } output for ( …… ) { } Huffman decoding Inverse Quantization IMDCT Application Acyclic Loop Non-DLPDLP Low-DLPHigh-DLP Execution Time 1-issue in-order core

University of Michigan Electrical Engineering and Computer Science Additional Flexibility on SIMD 6 SIMD Control RF FU Distributed VLIW Control RF FU Control DLP loop Non-DLP loop Program flow Non-DLP loop

University of Michigan Electrical Engineering and Computer Science Libra Additional Flexibility on SIMD Each logical lane has own ILP capability –The ILP capability is decided based on SIMD capability –Total degree of parallelism is consistent All resources are utilized 7 for ( …… ) { } Traditional SIMD DLP = 1 ILP = 1 Total: 1 DLP = 1 ILP = 16 Total = DLP = 2 ILP = 1 Total: 2 DLP = 2 ILP = 8 Total = 16 DLP = 4 ILP = 1 Total: 4 DLP = 4 ILP = 4 Total = 16 DLP = 8 ILP = 1 Total: 8 DLP = 8 ILP = 2 Total = 16 DLP = 16 ILP = 1 Total: 16 DLP = 16 ILP = 1 Total = 16 Full DLP mode Full ILP mode Hybrid mode

University of Michigan Electrical Engineering and Computer Science Looks Good, but Too Expensive! 8 Control RF FU Control RF FU Control RF FU Control RF FU Control

University of Michigan Electrical Engineering and Computer Science Opportunity: Resource Utilization Resource over-provision: Lane uniformity incurs inefficiency –Each SIMD lane provides the same functionalities –Only 32% (memory) and 16% (multiplication) of total dynamic instructions –More complex design, more static power consumption High variation in the resource requirements of loops –Simple sharing leads to performance degradation 9 Loop distribution over static ratio of multiply and memory instructions for ( …… ) { } Small fraction of mul/mem instructions

University of Michigan Electrical Engineering and Computer Science Adapting Heterogeneity (Homogeneous SIMD) 10 High DLP, 1 Multiplication SIMD Lane Cycle ADD Mul 4-way SIMD w/ 4 multipliers Lane 0 Lane 1 Lane 2 Lane 3 A0 A1 A2 M3 IPC = 4

University of Michigan Electrical Engineering and Computer Science Adapting Heterogeneity (Heterogeneous SIMD) 11 High DLP, 1 Multiplication SIMD Lane Cycle 4-way SIMD w/ 1 multiplier Lane 0 Lane 1 Lane 2 Lane 3 A0 A1 A2 M3 IPC = 2.29 Stall!!

University of Michigan Electrical Engineering and Computer Science Logical lane 0 Adapting Heterogeneity (Heterogeneous SIMD + Flexibility) 12 High DLP, 1 Multiplication SIMD Lane Cycle 4-way SIMD w/ 1 multiplier Lane 0 Lane 1 Lane 2 Lane 3 A0 A1 A0 A1 A2 A0 A1 A2 M3 A1 A2 M3 A2 M3 IPC = 4

University of Michigan Electrical Engineering and Computer Science Region-adaptive execution strategy customization Key insights Heterogeneous lane structure: less power/area Dynamic configurability: change ILP/DLP capability # of logical lanes: DLP, size of a logical lane: ILP Libra: Loop-adaptive SIMD Accelerator 13 High-DLP loops Low/No-DLP loops Application ExOp-intensive loops IntExpensive unit IntExpensive unit IntExpensive unit IntExpensive unit IntExpensive unit IntExpensive unit IntExpensive unit IntExpensive unit Traditional SIMDHeterogeneous SIMD

University of Michigan Electrical Engineering and Computer Science Libra Hardware Implementation Fully distributed nature including FUs, register files, and interconnections No dynamic routing logic: all communications statically generated 14 Intra-group Configurable Interconnect Inter-group Configurable Interconnect 1.Integer ALUs in all 4 FUs 2.One multiplier and memory unit per PE group Dense 4x8 full crossbar between FUs w/o writback Each FU is only connected to the corresponding neighbors in adjacent PE groups

University of Michigan Electrical Engineering and Computer Science Resource Full DLP Mode 15 Logical Lane 0 Logical Lane 1 2-wide transfer & data bypass A0 B0 C0C0 D0 A1 B1 C1 D1 Simple hardware sharing Execute 1 cycle difference for avoiding resource contention

University of Michigan Electrical Engineering and Computer Science Compilation Overview 16 Compiler Front-end Classifying the loop Resource allocation Code Generation Generic C program Hardware Information Determine SIMDizability Set SIMD mode Set ILP mode Profile Information Modulo scheduling Modulo scheduling List scheduling w/ multi-threading List scheduling w/ multi-threading Executable

University of Michigan Electrical Engineering and Computer Science Experimental Setup Target applications –Vision applications: SD-VBS [Venkata, IISWC '09] –Media benchmark: AAC decoder, H.264 decoder, and 3D rendering –Game physics benchmarks: line of sight, convolution, and conjugate Target architecture: SIMD, clustered VLIW, and Libra –16 ~ 64 heterogeneous/homogeneous resources IMPACT frontend compiler + cycle-accurate simulator Power measurement –IBM SOI 45nm 500MHz/0.81V 17

University of Michigan Electrical Engineering and Computer Science Performance with Heterogeneous Hardware heterogeneous datapath Libra is 2.04x/1.38x faster than heterogeneous SIMD/VLIW

University of Michigan Electrical Engineering and Computer Science Scalability with Heterogeneous Hardware 19 Libra is scalable when having enough total ILP/DLP parallelism

University of Michigan Electrical Engineering and Computer Science Homogeneous SIMD vs. Heterogeneous Libra Performance of Libra is better than SIMD Energy consumption shows similar trend –Less expensive functional units can reduce the overall power overheads –Ex. Total 11% power 32 PEs 20 (-) FU power saving (+) Control power overhead Power Performance Energy consumption

University of Michigan Electrical Engineering and Computer Science Mode Selection All available modes are used for considerable fraction The mode is selected based on application characteristics 21 Distribution of loop execution modes Logical lane size

University of Michigan Electrical Engineering and Computer Science Conclusion Mobile applications consist of loops with wide range of different level of ILP and DLP. Heterogeneous SIMD lane structure can reduce the power overhead of over-provided resources. Dynamic configurability enables broader applicability. Libra outperforms traditional SIMD by 1.58x performance improvement with 29% less energy consumption on 32-PE architectures. 22

University of Michigan Electrical Engineering and Computer Science 23 Questions? For more information