New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.

Slides:



Advertisements
Similar presentations
Impact of Interference on Multi-hop Wireless Network Performance Kamal Jain, Jitu Padhye, Venkat Padmanabhan and Lili Qiu Microsoft Research Redmond.
Advertisements

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
A Graph-Partitioning-Based Approach for Multi-Layer Constrained Via Minimization Yih-Chih Chou and Youn-Long Lin Department of Computer Science, Tsing.
The University of Adelaide, School of Computer Science
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Constraint Programming for Compiler Optimization March 2006.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.
More Graph Algorithms Minimum Spanning Trees, Shortest Path Algorithms.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
ASWP – Ad-hoc Routing with Interference Consideration Zhanfeng Jia, Rajarshi Gupta, Jean Walrand, Pravin Varaiya Department of EECS University of California,
Network Coding Project presentation Communication Theory 16:332:545 Amith Vikram Atin Kumar Jasvinder Singh Vinoo Ganesan.
Code Generation Simple Register Allocation Mooly Sagiv html:// Chapter
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
Algoritmi on-line e risoluzione di problemi complessi Carlo Fantozzi
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles.
7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Scheduling Data-Intensive Workflows Tim H. Wong, Daniel Zinn, Bertram Ludäscher (UC.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Bold Stroke January 13, 2003 Advanced Algorithms CS 539/441 OR In Search Of Efficient General Solutions Joe Hoffert
Instruction Set Architecture
Network Aware Resource Allocation in Distributed Clouds.
APPROXIMATION ALGORITHMS VERTEX COVER – MAX CUT PROBLEMS
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.
Optimal Client-Server Assignment for Internet Distributed Systems.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
HKOI 2005 Training Introduction to Algorithms Alan, Tam Siu Lung.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
COSC 5341 High-Performance Computer Networks Presentation for By Linghai Zhang ID:
Multi-Source Latency Variation Synchronization for Collaborative Applications Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallel Computing Presented by Justin Reschke
My Coordinates Office EM G.27 contact time:
Modelling and Solving Configuration Problems on Business
Computing and Compressive Sensing in Wireless Sensor Networks
Conception of parallel algorithms
SIMD Multimedia Extensions
Roadmap to Programming work, right, fast KISS
Vector Processing => Multimedia
CSCI1600: Embedded and Real Time Software
Compiler Back End Panel
Compiler Back End Panel
Multivector and SIMD Computers
To DSP or Not to DSP? Chad Erven.
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
CSCI1600: Embedded and Real Time Software
Presentation transcript:

New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion

Fireman, Petrank & ZaksSIMD Alignment Alg's 2 SIMD (Single Instruction Multiple Data) Support packed vector operations. + =

Fireman, Petrank & ZaksSIMD Alignment Alg's 3 SIMD (Single Instruction Multiple Data) Support packed vector operations. Widely used with multimedia extensions. –Altivec (IBM, Motorola), SSE (Intel). Manual programming for SIMD is error prone. Automatically generating optimized code for SIMD (auto-vectorization or simdization) is challenging, but promising. One challenge: satisfy the alignment constraint imposed by the hardware. –Altivec: 16-bytes registers are loaded from 16-bytes consecutive and aligned memory locations.

Fireman, Petrank & ZaksSIMD Alignment Alg's 4 Misaligned Streams Example: for (i = 0; i < 1000; i++) a[i] = b[i+1] +c[i+2]; The above code requires additional realignment operations. … … b[i-1] b[i]b[i+1]b[i+2] b[i+3] b[i+4] b[i+5]b[i+6]b[i+7] b[i]b[i+1]b[i+2] b[i+3] b[i+4] b[i+5]b[i+6]b[i+7] b[i+1]b[i+2] b[i+3] b[i+4]

Fireman, Petrank & ZaksSIMD Alignment Alg's 5 The SIMD Alignment Problem Given an expression, execute it with the minimum number of “shifts”. Requirements: –Input and output operands come with a specified alignment –The inputs to each operation have the same alignment Usually shifts are executed inside a loop and have a noticeable cost.

Fireman, Petrank & ZaksSIMD Alignment Alg's 6 A Graph Abstraction Represent expressions as graphs (standard). Annotate alignments of inputs and outputs. Solution provides alignments for the inner vertices (the operations): alignment of the operation inputs. Mapping of graph solutions to expression solutions (and vice versa) is easy if each array appears in a single alignment only.

Fireman, Petrank & ZaksSIMD Alignment Alg's bcde a a[i+3]= b[i+1]*c[i+3] + d[i+3]*e[i+2] * * A (Tree) Example

Fireman, Petrank & ZaksSIMD Alignment Alg's 8 Previous Heuristics Several simple heuristics have been proposed to solve the alignment problem. The Zero-Shift Policy The Eager-Shift Policy The Lazy-Shift Policy The Majority Policy

Fireman, Petrank & ZaksSIMD Alignment Alg's 9 Talk Outline Introduction: SIMD alignment, graph abstraction, heuristics. Tree expressions: dynamic programming. Expressions with two alignments: node multi- way cuts. The general case Measurements Conclusions

Fireman, Petrank & ZaksSIMD Alignment Alg's 10 Two Interesting Special Cases Single-appearance tree expressions –Each array appears once in the input. Expressions with only two alignments –Each array appears with only one of the alignments We present two efficient algorithms that solve the problem optimally for these two cases.

Fireman, Petrank & ZaksSIMD Alignment Alg's bcde a a[i+3]= b[i+1]*c[i+3] + d[i+3]*e[i+2] * * + A Tree Example

Optimal Algorithm for a Tree Dynamic programming. –Progressive local computations for the global optimum. l m j

Optimal Algorithm for a Tree Dynamic programming. –Progressive local computations for the global optimum. l m j 1 2 … k … i 1 2 … k … i 1 2 … k … i 1 2 … k … i

bcde a * * ∞ ∞ ∞ 0 ∞0∞ 0 0 ∞ ∞ ∞ ∞ 2 ∞

Fireman, Petrank & ZaksSIMD Alignment Alg's 15 Complexity – Tree Expressions Traverse the tree nodes, for each possible alignment, do work for each incoming edge. Overall O(k|V|), where k is the number of possible alignment.

Fireman, Petrank & ZaksSIMD Alignment Alg's 16 1 It Doesn’t Work on DAGs + 1 *

Fireman, Petrank & ZaksSIMD Alignment Alg's 17 Two-Alignments Expressions Not necessarily trees. Only two alignments in the expression.

Fireman, Petrank & ZaksSIMD Alignment Alg's bcde f * * a * 1 * for (i = 0; i < 1000; i++) f[i] = (b[i+1]*a[i+1] + c[i+1]*a[i+1]) + (d[i]*a[i+1] + e[i]*a[i+1])

Fireman, Petrank & ZaksSIMD Alignment Alg's bcde f * * a * 1 * for (i = 0; i < 1000; i++) f[i] = (b[i+1]*a[i+1] + c[i+1]*a[i+1]) + (d[i]*a[i+1] + e[i]*a[i+1]) Only a single shift required here.

Fireman, Petrank & ZaksSIMD Alignment Alg's bc e f a 3 d Node Multi-way cut S1S1 S2S2 S3S3

Fireman, Petrank & ZaksSIMD Alignment Alg's 21 Choosing a node for the cut means: shift after executing it. To make sure that all inputs of an operation get aligned: link them all to each other ! –Moral graphs Using Node Multiway Cuts 01 1 ac d 0 b *

bcde f * * + 1 a * * + + S0S0 S1S

Fireman, Petrank & ZaksSIMD Alignment Alg's bcde f * * a * 1 * for (i = 0; i < 1000; i++) f[i] = (b[i+1]*a[i+1] + c[i+1]*a[i+1]) + (d[i]*a[i+1] + e[i]*a[i+1]) Cost: 2 shifts Each cut node implies one shift

Fireman, Petrank & ZaksSIMD Alignment Alg's 24 The Algorithm Create the modified graph. Find a 2-way min-cut via max-flow algorithms. Complexity: min-cut on modified graph O(|V| 4 ). 3-way and up node-cuts are NP-Hard. The node-cut and edge-cut problems are different from the SIMD alignment problem. But relations exist. Derive approximation alg’s for SIMD from approx. alg. for node-cut and edge-cut. (See paper.) No NP-Completeness result known for this problem.

Fireman, Petrank & ZaksSIMD Alignment Alg's 25 Measurements Part 1: How much does it cost to shift ? Part 2: Generate Random graphs and check: OPT/HEU for random graphs –Single-appearance trees –2-alignment DAGS OPT = # shifts used by the optimal solution HEU = # shifts used by the best heuristic

Fireman, Petrank & ZaksSIMD Alignment Alg's e f 1 1 cd 1 1 ab Part 1: cost of shift Cost of best heuristic: 2. Cost of optimal solution: 1

Fireman, Petrank & ZaksSIMD Alignment Alg's e f 1 1 cd 1 1 ab Part 1: cost of shift Cost of best heuristic: 2. Cost of optimal solution: 1 6% runtime improvement

Fireman, Petrank & ZaksSIMD Alignment Alg's 28 Random Trees: OPT/HEU

Fireman, Petrank & ZaksSIMD Alignment Alg's 29 Random Layered Graphs: OPT/HEY w – width of layered graph d – graph ’ s depth

Fireman, Petrank & ZaksSIMD Alignment Alg's 30 Related Work [Eichnberg-Wu-O’Brien PLDI’04]: –set of alignment heuristics. –Code generation (can use our algorihtms). [Wu-Eichnberg-Wang CGO 2005] –Runtime alignments Several compilers (e.g., GCC, VAST, compilers for SSE) use the zero-shift policy. [Ren-Wu-Padua PLDI 2006] handle strides > 1. Much literature on distributing data to processors.

Fireman, Petrank & ZaksSIMD Alignment Alg's 31 Summary The SIMD alignment problem is important. Previously only heuristics were used We propose optimal algorithms for: –single-appearance tree expressions –expressions with only two alignments Guaranteed approximation ratio for the general case. Measurements show that optimizations are effective. Future work: is SIMD-Alignment NP-Complete, or can you solve it? –More special cases?