University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun Park 1, Sangwon Seo 2, Hyunchul Park 3, Hyoun Kyu Cho 1, and Scott Mahlke 1 March 6, 2012 1 University of Michigan, Ann Arbor 2 Qualcomm Incorporated, San Diego, CA 3 Programmin Systems Lab, Intel Labs, Santa Clara, CA

University of Michigan Electrical Engineering and Computer Science Convergence of Functionalities 2 Convergence of functionalities demands a flexible solution Applications have different characteristics Anatomy of an iPhone 4G Wireless Navigation Audio Video 3D Flexible Accelerator!

University of Michigan Electrical Engineering and Computer Science SIMD : Attractive Alternative to ASICs 3  Suitable for running wireless and multimedia applications for future embedded systems  Advantage  High throughput  Low fetch-decode overhead  Easy to scale  Disadvantage  Hard to realize high resource utilization  High SIMDization overhead  Example SIMD architectures  IBM Cell, ARM NEON, Intel MIC architecture, etc. Example SIMD machine: 100 MOps /mW VLIW SIMD FUs 5.6x 2x

University of Michigan Electrical Engineering and Computer Science Under-utilization on wide SIMD Multimedia applications have various natural SIMD widths –SIMD width characterization of innermost loops (Intel Compiler rule) –Inside & across applications How to use idle SIMD resources? 4 AAC 3DH.264 Execution time distribution at different SIMD widths Full Under Resource utilization 16-way SIMD

University of Michigan Electrical Engineering and Computer Science Traditional Solutions for Under-utilization Dynamic power gating –Selectively cut off unused SIMD lanes –Effective dynamic & leakage power savings –Transition time & power overhead –High area overhead 5 Thread Level parallelism –Execute multiple threads having separate data –Different instruction flow –Input-dependent control flow –High memory pressure On On Off Thread 1 Thread 1 Thread 2 Thread 2 Thread 3 Thread 3 Thread 4 Thread 4

University of Michigan Electrical Engineering and Computer Science Objective of This Work Beyond loop-level SIMD –Put idle SIMD lanes to work –Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD parallelism is insufficient Possible SIMD instructions inside a vectorized basic block –Perform same work –Same data flow –More than 50% of total instructions have some opportunities Challenges –High data movement overhead between lanes –Hard to find best instruction packing combination 6

University of Michigan Electrical Engineering and Computer Science Partial SIMD Opportunity 7 1: For (it = 0; It < 4; it++) { 2: i = a[it] + b[it]; 3: j = c[it] + d[it]; 4: k = e[it] + f[it]; 5: l = g[it] + h[it]; 6: m = i + j; 7: n = k + l; 8: result[it] = m + n; 9: } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 SIMD Resource 1. Loop level SIMDization2. Partial SIMDization +12 +4

University of Michigan Electrical Engineering and Computer Science Data level parallelism between ‘identical subgraphs’ –SIMDizable operators –Isomorphic dataflow –No dependencies on each other Advantages –Minimize overhead No inter-lane data movement inside a subgraph –High instruction packing gain Multiple instructions inside a subgraph increase the packing gain Cost Gain Cost Gain Subgraph Level Parallelism(SGLP) 8 Cost Gain

University of Michigan Electrical Engineering and Computer Science Example: program order (2 degree) 9 LD0 0 1 4 5 9 8 ST0 ST1 LD1 2 3 6 7 11 10 ST2 ST3 FFT kernel SIMD Lane Cycle 0 1 Gain: 1 = 9 (SIMD) – 8 (overhead) LD0 LD1 Inter-lane move 0 1 2 3 4 5 6 7 8 9 10 11 ST0 ST1 ST2 ST3 Lane 1 Lane 0

University of Michigan Electrical Engineering and Computer Science Example: SGLP (2 degree) 10 SIMD Lane Cycle 0 1 LD0 0 1 4 5 9 8 ST0 ST1 LD1 2 3 6 7 11 10 ST2 ST3 FFT kernel LD0 LD1 0 2 1 3 5 7 4 6 8 10 9 11 ST0 ST2 ST1 ST3 Inter-lane move Gain: 7 = 9 (SIMD) – 2 (overhead) Inter-lane move Lane 1 Lane 0

University of Michigan Electrical Engineering and Computer Science Compilation Overview 11 Loop-unrolling & Vectorization Dataflow Generation 1. Subgraph Identification 2. SIMD Lane Assignment 3. Code Generation Application Hardware Information Loop-level Vectorized Basicblock Dataflow Graph Identical Subgraphs Lane-assigned Subgraphs

University of Michigan Electrical Engineering and Computer Science 1. Subgraph Identification Heuristic discovery –Grow subgraphs from seed nodes and find identical subgraphs Additional conditions over traditional subgraph search –Corresponding operators are identical –Operand type should be same: register/constant –No inter-subgraph dependencies 12 abcdefgh result 1 2 + * 256

University of Michigan Electrical Engineering and Computer Science 1. Subgraph gain Assign lanes based on decreasing order of gain Gain: A > B > C > D Select subgraphs to be packed and assign them to SIMD lane groups –Pack maximum number of instructions with minimum overhead –Safe parallel execution without dependence violation –Criteria: gain, affinity, partial order 2. SIMD Lane Assignment 13 A0 B0 D A1 B1 SIMD Lane Time Lane 0~3 Lane 4~7 A0 A1 B0 B1 C0 C1 D A0 B0 D A1 B1 C1 C0 2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost Affinity: B0 is closer to A0 than A1 Conflict!! C0-1 C0-0 C1-1 C1-0 3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same Partial order: C0 ≠ C1

University of Michigan Electrical Engineering and Computer Science Experimental Setup 144 loops from industry-level optimized media applications –AAC decoder (MPEG4 audio decoding, low complexity profile) –H.264 decoder (MPEG4 video decoding, baseline profile, qcif) –3D (3D graphics rendering). Target architecture: wide vector machines –SIMD width: 16 ~ 64 –SODA-style wide vector instruction set –Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec)) IMPACT frontend compiler + cycle-accurate simulator Compared to 2 other solutions –SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00] –ILP: Instruction level parallelism on same-way VLIW machines Apply 2 ~ 4 degree SGLP 14

University of Michigan Electrical Engineering and Computer Science Static Performance 15 SGLP retains similar trend as ILP after overhead consideration Max 1.66x @ 4-way (SLP 1.27x) See the paper for representative kernels (FFT, DCT, HafPel….) AAC 3D H.264 AVG

University of Michigan Electrical Engineering and Computer Science Dynamic Performance on SIMD 16 AAC 3D H.264AVG Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. ( Up to 4-way) Max 1.76x speedups (SLP: 1.29x)

University of Michigan Electrical Engineering and Computer Science Energy@H.264 Execution 17 SGLP @ 32-wide SIMDILP @ 4 way 8-wide VLIWgain power (mW)54.4093.17-31.61% cycle(million)13.0710.77+21.36% energy (mJ)3.555.02-29.14% Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD Control 8-wide SIMD 30% energy efficient! 200 MHz(IBM 0.65nm technology) SIMDVLIW

University of Michigan Electrical Engineering and Computer Science Conclusion SIMD is an energy-efficient solution for mobile systems. SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism. Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block. SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture. 18

University of Michigan Electrical Engineering and Computer Science 19 Questions? For more information http://cccp.eecs.umich.edu

University of Michigan Electrical Engineering and Computer Science Example 2: High-level View 20 Kernel 0 SIMD width: 8 Kernel 1 SIMD width: 4 Kernel 2 SIMD width: 8 B A0A1 C0C1 D SIMD Lane Time Kernel 0 B A0A1 C0C1 D B A0A1 C0C1 D Kernel 2 Lane 0~3 Lane 4~7 Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead) Kernel 1

University of Michigan Electrical Engineering and Computer Science Static Performance 21 FFTMDCTMatMul4x4MatMul3x3HalfPelQuarterPelAAC3DH.264AVG Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3) SGLP retains similar trend as ILP after overhead consideration Max 1.66x @ 4-way (SLP 1.27x) Kernel Application

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun."— Presentation transcript:

Similar presentations

About project

Feedback