Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.

Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang

Source Source ACM Symposium on Parallel Algorithms and Architectures Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures Authors Timothy Furtak José Nelson Amaral Robert Niewiadomski

Outline Introduction Sorting network Sorting algorithms Experimental evaluation Contributions

Introduction Use SIMD resources to improve the performance of sorting algorithms for short sequence. Initial inspiration: need for Fast sorting of short sequences implementation of Graphics rendering in interactive video game SIMD machineries

Introduction SIMD machineries X86-64’s SSE2 (Streaming SIMD Extensions 2) G5’s AltiVec AltiVec,SSE2: SIMD instruction sets, both feature 128-bit vector registers

Sorting network a comparator network produces a sorted output for any possible input sequence. COMP(a, b) — the inputs are two storage units: memory locations, registers, or vector-register elements — a and b, each containing a numerical input.

Sorting network Size: the total number of comparators in the network. Depth: the length of the critical path in its dependence graph.

Sorting network

A comparator moves the larger value to the left, and the smaller value to the right. For instance, Figure1 size=5,width=3; Inputs: a = 7, b = 2, c = 5, d = 9 Output: a = 9, b = 7, c = 5, d = 2.

Supporting hardware for Sorting Network The comparator required by a sorting network is easily constructed using these two operations, a copy instruction, and a temporary variable. Min and max instructions min(a, b) = a : a ≤ b b : otherwise max(a, b) = a : a ≥ b b : otherwise

Supporting hardware for Sorting Network x86-64 architectures supports the SSE2 min and max operations that return the minimum (maximum) packed single-precision floating- point values.

Supporting hardware for Sorting Network Width: the number of vectors being sorted. x86-64 has 16 XMM vector registers, and each register can hold 4 floating-point values. Sorting the values in n XMM registers using a sorting network produces 4 sorted streams of data of length n. 1 ≤ n < 16, one register must be reserved as temporary storage for the swap of values.

Three sorting methods Two pass sorting with insertion sorting Two pass sorting with merge sorting One pass sorting (Register sorting)

Tow pass sorting In the first phase the SIMD registers and instructions are used to generate a partially-sorted output.  In the second phase a standard sorting algorithm — insertion sort and mergesort are investigated in this paper — finishes the sorting.

First phase: SIMD sort Vector registers A1B1C1D1 A2B2C2D2 AnBnCnDn …… After SIMD sort:

Second phase Insertion sort Merge sort A1<A5<A9 A2<A6<A10 A3<A7<A11 A4<A8<A12 A1<A2<A3 A4<A5<A6 A7<A8<A9 A10<A11<A12 A1A2A3A4 A5A6A7A8 A9A10A11A12 A1A4A7A10 A2A5A8A11 A3A6A9A12 A1A2A3A4A5A6A7A8A9A10A11A12

One pass sorting (Register sorting) Algorithm input Initial state Align a set of comparators Write values back to memory

4-elements example P1={comp(a,c) comp(b,d)} P2={comp(a,b) comp(c,d)} P3={comp(b,c)}

One concrete example

SSE2 instructions used

The method is also applied to sort Key- pointer pairs and D-heaps.

Evaluation

Contributions Effectively use SIMD resources to improve performance of sorting short sequence through the reduction of memory references and increases in ILP.

Contributions 1.three algorithms that use the SIMD machinery for efficient in-register sorting of short sequences 2.a method to use iterative-deepening search to find fast instruction sequences to move data within the SIMD registers 3.an extensive experimental study that indicates the elimination of loads, stores, branches correlates well with improvement performance.

Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.

Similar presentations

Presentation on theme: "Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.

Similar presentations

Presentation on theme: "Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang."— Presentation transcript:

Similar presentations

About project

Feedback