University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

SAGE: Self-Tuning Approximation for Graphics Engines

11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.

Generic Software Pipelining at the Assembly Level Markus Pister

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Basics and Architectures

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Vector Processing => Multimedia

Improved schedulability on the ρVEX polymorphic VLIW processor

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Compiler Back End Panel

Compiler Back End Panel

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

The Vector-Thread Architecture

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun Park 1, Sangwon Seo 2, Hyunchul Park 3, Hyoun Kyu Cho 1, and Scott Mahlke 1 March 6, University of Michigan, Ann Arbor 2 Qualcomm Incorporated, San Diego, CA 3 Programmin Systems Lab, Intel Labs, Santa Clara, CA

University of Michigan Electrical Engineering and Computer Science Convergence of Functionalities 2 Convergence of functionalities demands a flexible solution Applications have different characteristics Anatomy of an iPhone 4G Wireless Navigation Audio Video 3D Flexible Accelerator!

University of Michigan Electrical Engineering and Computer Science SIMD : Attractive Alternative to ASICs 3  Suitable for running wireless and multimedia applications for future embedded systems  Advantage  High throughput  Low fetch-decode overhead  Easy to scale  Disadvantage  Hard to realize high resource utilization  High SIMDization overhead  Example SIMD architectures  IBM Cell, ARM NEON, Intel MIC architecture, etc. Example SIMD machine: 100 MOps /mW VLIW SIMD FUs 5.6x 2x

University of Michigan Electrical Engineering and Computer Science Under-utilization on wide SIMD Multimedia applications have various natural SIMD widths –SIMD width characterization of innermost loops (Intel Compiler rule) –Inside & across applications How to use idle SIMD resources? 4 AAC 3DH.264 Execution time distribution at different SIMD widths Full Under Resource utilization 16-way SIMD

University of Michigan Electrical Engineering and Computer Science Traditional Solutions for Under-utilization Dynamic power gating –Selectively cut off unused SIMD lanes –Effective dynamic & leakage power savings –Transition time & power overhead –High area overhead 5 Thread Level parallelism –Execute multiple threads having separate data –Different instruction flow –Input-dependent control flow –High memory pressure On On Off Thread 1 Thread 1 Thread 2 Thread 2 Thread 3 Thread 3 Thread 4 Thread 4

University of Michigan Electrical Engineering and Computer Science Objective of This Work Beyond loop-level SIMD –Put idle SIMD lanes to work –Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD parallelism is insufficient Possible SIMD instructions inside a vectorized basic block –Perform same work –Same data flow –More than 50% of total instructions have some opportunities Challenges –High data movement overhead between lanes –Hard to find best instruction packing combination 6

University of Michigan Electrical Engineering and Computer Science Partial SIMD Opportunity 7 1: For (it = 0; It < 4; it++) { 2: i = a[it] + b[it]; 3: j = c[it] + d[it]; 4: k = e[it] + f[it]; 5: l = g[it] + h[it]; 6: m = i + j; 7: n = k + l; 8: result[it] = m + n; 9: } SIMD Resource 1. Loop level SIMDization2. Partial SIMDization

University of Michigan Electrical Engineering and Computer Science Data level parallelism between ‘identical subgraphs’ –SIMDizable operators –Isomorphic dataflow –No dependencies on each other Advantages –Minimize overhead No inter-lane data movement inside a subgraph –High instruction packing gain Multiple instructions inside a subgraph increase the packing gain Cost Gain Cost Gain Subgraph Level Parallelism(SGLP) 8 Cost Gain

University of Michigan Electrical Engineering and Computer Science Example: program order (2 degree) 9 LD ST0 ST1 LD ST2 ST3 FFT kernel SIMD Lane Cycle 0 1 Gain: 1 = 9 (SIMD) – 8 (overhead) LD0 LD1 Inter-lane move ST0 ST1 ST2 ST3 Lane 1 Lane 0

University of Michigan Electrical Engineering and Computer Science Example: SGLP (2 degree) 10 SIMD Lane Cycle 0 1 LD ST0 ST1 LD ST2 ST3 FFT kernel LD0 LD ST0 ST2 ST1 ST3 Inter-lane move Gain: 7 = 9 (SIMD) – 2 (overhead) Inter-lane move Lane 1 Lane 0

University of Michigan Electrical Engineering and Computer Science Compilation Overview 11 Loop-unrolling & Vectorization Dataflow Generation 1. Subgraph Identification 2. SIMD Lane Assignment 3. Code Generation Application Hardware Information Loop-level Vectorized Basicblock Dataflow Graph Identical Subgraphs Lane-assigned Subgraphs

University of Michigan Electrical Engineering and Computer Science 1. Subgraph Identification Heuristic discovery –Grow subgraphs from seed nodes and find identical subgraphs Additional conditions over traditional subgraph search –Corresponding operators are identical –Operand type should be same: register/constant –No inter-subgraph dependencies 12 abcdefgh result * 256

University of Michigan Electrical Engineering and Computer Science 1. Subgraph gain Assign lanes based on decreasing order of gain Gain: A > B > C > D Select subgraphs to be packed and assign them to SIMD lane groups –Pack maximum number of instructions with minimum overhead –Safe parallel execution without dependence violation –Criteria: gain, affinity, partial order 2. SIMD Lane Assignment 13 A0 B0 D A1 B1 SIMD Lane Time Lane 0~3 Lane 4~7 A0 A1 B0 B1 C0 C1 D A0 B0 D A1 B1 C1 C0 2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost Affinity: B0 is closer to A0 than A1 Conflict!! C0-1 C0-0 C1-1 C Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same Partial order: C0 ≠ C1

University of Michigan Electrical Engineering and Computer Science Experimental Setup 144 loops from industry-level optimized media applications –AAC decoder (MPEG4 audio decoding, low complexity profile) –H.264 decoder (MPEG4 video decoding, baseline profile, qcif) –3D (3D graphics rendering). Target architecture: wide vector machines –SIMD width: 16 ~ 64 –SODA-style wide vector instruction set –Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec)) IMPACT frontend compiler + cycle-accurate simulator Compared to 2 other solutions –SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00] –ILP: Instruction level parallelism on same-way VLIW machines Apply 2 ~ 4 degree SGLP 14

University of Michigan Electrical Engineering and Computer Science Static Performance 15 SGLP retains similar trend as ILP after overhead consideration Max 4-way (SLP 1.27x) See the paper for representative kernels (FFT, DCT, HafPel….) AAC 3D H.264 AVG

University of Michigan Electrical Engineering and Computer Science Dynamic Performance on SIMD 16 AAC 3D H.264AVG Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. ( Up to 4-way) Max 1.76x speedups (SLP: 1.29x)

University of Michigan Electrical Engineering and Computer Science Execution wide 4 way 8-wide VLIWgain power (mW) % cycle(million) % energy (mJ) % Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD 30% energy efficient! 200 MHz(IBM 0.65nm technology) SIMDVLIW

University of Michigan Electrical Engineering and Computer Science Conclusion SIMD is an energy-efficient solution for mobile systems. SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism. Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block. SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture. 18

University of Michigan Electrical Engineering and Computer Science 19 Questions? For more information

University of Michigan Electrical Engineering and Computer Science Example 2: High-level View 20 Kernel 0 SIMD width: 8 Kernel 1 SIMD width: 4 Kernel 2 SIMD width: 8 B A0A1 C0C1 D SIMD Lane Time Kernel 0 B A0A1 C0C1 D B A0A1 C0C1 D Kernel 2 Lane 0~3 Lane 4~7 Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead) Kernel 1

University of Michigan Electrical Engineering and Computer Science Static Performance 21 FFTMDCTMatMul4x4MatMul3x3HalfPelQuarterPelAAC3DH.264AVG Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3) SGLP retains similar trend as ILP after overhead consideration Max 4-way (SLP 1.27x) Kernel Application