CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Slides:

Advertisements

Similar presentations

David Hansen and James Michelussi

Advertisements

Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*

Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.

Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.

Fast Fourier Transform Lecture 6 Spoken Language Processing Prof. Andrew Rosenberg.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,

Data Parallel Algorithms Presented By: M.Mohsin Butt

FFT1 The Fast Fourier Transform by Jorge M. Trabal.

Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne Fast Fourier Transform Jean Baptiste Joseph Fourier ( ) These lecture.

May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 16: Application-Driven Hardware Acceleration (1/4)

Lecture #18 FAST FOURIER TRANSFORM INVERSES AND ALTERNATE IMPLEMENTATIONS Department of Electrical and Computer Engineering Carnegie Mellon University.

Introduction to Algorithms

Fast Fourier Transform Irina Bobkova. Overview I. Polynomials II. The DFT and FFT III. Efficient implementations IV. Some problems.

Fast Fourier Transforms

Parallelizing the Fast Fourier Transform David Monismith cs599.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 12: February 6, 2006 Sorting.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

FFT USING OPEN-MP Done by: HUSSEIN SALIM QASIM & Tiba Zaki Abdulhameed

CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.

FFT1 The Fast Fourier Transform. FFT2 Outline and Reading Polynomial Multiplication Problem Primitive Roots of Unity (§10.4.1) The Discrete Fourier Transform.

5.6 Convolution and FFT. 2 Fast Fourier Transform: Applications Applications. n Optics, acoustics, quantum physics, telecommunications, control systems,

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

The Fast Fourier Transform

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.

Inverse DFT. Frequency to time domain Sometimes calculations are easier in the frequency domain then later convert the results back to the time domain.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

1 Fast Polynomial and Integer Multiplication Jeremy R. Johnson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Speaker: Darcy Tsai Advisor: Prof. An-Yeu Wu Date: 2013/10/31

Fast Fourier Transforms. 2 Discrete Fourier Transform The DFT pair was given as Baseline for computational complexity: –Each DFT coefficient requires.

BINARY SEARCH CS16: Introduction to Data Structures & Algorithms Thursday February 12,

CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.

EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Chapter 9. Computation of Discrete Fourier Transform 9.1 Introduction 9.2 Decimation-in-Time Factorization 9.3 Decimation-in-Frequency Factorization 9.4.

Discrete Fourier Transform

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

DIGITAL SIGNAL PROCESSING ELECTRONICS

Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt

September 4, 1997 Applied Symbolic Computation (CS 300) Fast Polynomial and Integer Multiplication Jeremy R. Johnson.

CS 179: GPU Programming Lecture 7.

CS38 Introduction to Algorithms

Applied Symbolic Computation

Real-time double buffer For hard real-time

Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

Real-time 1-input 1-output DSP systems

4.1 DFT In practice the Fourier components of data are obtained by digital computation rather than by analog processing. The analog values have to be.

The Fast Fourier Transform

Parallel Computation Patterns (Scan)

Mattan Erez The University of Texas at Austin

September 4, 1997 Applied Symbolic Computation (CS 567) Fast Polynomial and Integer Multiplication Jeremy R. Johnson.

Chapter 9 Computation of the Discrete Fourier Transform

Applied Symbolic Computation

Parallel Sorting Algorithms

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Chapter 19 Fast Fourier Transform

The Fast Fourier Transform

Lecture #18 FAST FOURIER TRANSFORM ALTERNATE IMPLEMENTATIONS

Fast Polynomial and Integer Multiplication

Presentation transcript:

CS 179: GPU Programming Lecture 8

Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Today GPU-accelerated Fast Fourier Transform cuFFT (FFT library)

Signals (again)

“Frequency content” What frequencies are present in our signals?

Discrete Fourier Transform (DFT)

Roots of unity

Discrete Fourier Transform (DFT) Number of distinct values: N, not N 2 !

(Proof)

DFT of x n, even n! DFT of x n, odd n!

(Divide-and-conquer algorithm) Recursive-FFT(Vector x): if x is length 1: return x x_even <- (x0, x2,..., x_(n-2) ) x_odd <- (x1, x3,..., x_(n-1) ) y_even <- Recursive-FFT(x_even) y_odd <- Recursive-FFT(x_odd) for k = 0, …, (n/2)-1: y[k] <- y_even[k] + w k * y_odd[k] y[k + n/2] <- y_even[k] - w k * y_odd[k] return y

(Divide-and-conquer algorithm) Recursive-FFT(Vector x): if x is length 1: return x x_even <- (x0, x2,..., x_(n-2) ) x_odd <- (x1, x3,..., x_(n-1) ) y_even <- Recursive-FFT(x_even) y_odd <- Recursive-FFT(x_odd) for k = 0, …, (n/2)-1: y[k] <- y_even[k] + w k * y_odd[k] y[k + n/2] <- y_even[k] - w k * y_odd[k] return y T(n/2) O(n)

Runtime Recurrence relation: – T(n) = 2T(n/2) + O(n) O(n log n) runtime!Much better than O(n 2 ) (Minor caveat: N must be power of 2) – Usually resolvable

Parallelizable? O(n 2 ) algorithm certainly is! for k = 0,…,N-1: for n = 0,…,N-1:... Sometimes parallelization outweighs runtime! – (N-body problem, …)

Recursive index tree x0, x1, x2, x3, x4, x5, x6, x7 x0, x2, x4, x6x1, x3, x5, x7 x0, x4x2, x6x1, x5x3, x7 x0 x4x2x6x1x5x3x7

Recursive index tree x0, x1, x2, x3, x4, x5, x6, x7 x0, x2, x4, x6x1, x3, x5, x7 x0, x4x2, x6x1, x5x3, x7 x0 x4x2x6x1x5x3x7 Order?

Bit-reversal order 0000reverse of…

Iterative approach x0, x1, x2, x3, x4, x5, x6, x7 x0, x2, x4, x6x1, x3, x5, x7 x0, x4x2, x6x1, x5x3, x7 x0 x4x2x6x1x5x3x7 Stage 3 Stage 2 Stage 1

(Divide-and-conquer algorithm review) Recursive-FFT(Vector x): if x is length 1: return x x_even <- (x0, x2,..., x_(n-2) ) x_odd <- (x1, x3,..., x_(n-1) ) y_even <- Recursive-FFT(x_even) y_odd <- Recursive-FFT(x_odd) for k = 0, …, (n/2)-1: y[k] <- y_even[k] + w k * y_odd[k] y[k + n/2] <- y_even[k] - w k * y_odd[k] return y T(n/2) O(n)

Iterative approach x0, x1, x2, x3, x4, x5, x6, x7 x0, x2, x4, x6x1, x3, x5, x7 x0, x4x2, x6x1, x5x3, x7 x0 x4x2x6x1x5x3x7 Stage 3 Stage 2 Stage 1

Iterative approach Bit-reversed access

Iterative approach Bit-reversed access Stage 1Stage 2Stage 3

Iterative approach Iterative-FFT(Vector x): y <- (bit-reversed order x) N <- y.length for s = 1,2,…,log(N): m <- 2 s w n <- e 2πj/m for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1: u <- y[k + j] t <- (w n ) j * y[k + j + m/2] y[k + j] <- u + t y[k + j + m/2] <- u - t return y Bit-reversed access rithms/book6/chap32.htm Stage 1Stage 2Stage 3

Iterative approach Iterative-FFT(Vector x): y <- (bit-reversed order x) N <- y.length for s = 1,2,…,log(N): m <- 2 s w n <- e 2πj/m for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1: u <- y[k + j] t <- (w n ) j * y[k + j + m/2] y[k + j] <- u + t y[k + j + m/2] <- u - t return y O(log N) iterations O(N/m) iterations O(m) iterations Total: O(N log N) runtime

CUDA approach Bit-reversed access Stage 1Stage 2Stage 3

CUDA approach Bit-reversed access Stage 1Stage 2Stage 3 __syncthreads() barriers!

CUDA approach Bit-reversed access Stage 1Stage 2Stage 3 __syncthreads() barriers! Non-coalesced memory access!!

CUDA approach Bit-reversed access Stage 1Stage 2Stage 3 __syncthreads() barriers! Load into shared memory Non-coalesced memory access!!

CUDA approach Bit-reversed access Stage 1Stage 2Stage 3 __syncthreads() barriers! Load into shared memory Non-coalesced memory access!! Bank conflicts!!

Generalization Above algorithm works for array sizes of 2 n (integer n) Generalize?

“Bit”-reversal order N = 6: (small example) – Factors into 3,2 1 st digit: base 3 2 nd digit: base

Generalization Radix-3 case (followed by radix-2)

Suppose N = N 1 N 2 (factorization…) Algorithm: (recursive) – Take FFT along rows – Do multiply operations (complex roots of unity) – Take FFT along columns “Cooley-Tukey FFT algorithm”

Inverse DFT/FFT Similarly parallelizable! – (Sign change in complex terms)

cuFFT FFT library included with CUDA – Approximately implements previous algorithms (Cooley-Tukey/Bluestein) Also handles higher dimensions

cuFFT 1D example Correction: Remember to use cufftDestroy(plan) when finished with transforms

cuFFT 3D example Correction: Remember to use cufftDestroy(plan) when finished with transforms

Remarks As before, some parallelizable algorithms don’t easily “fit the mold” – Hardware matters more! – Some resources: Introduction to Algorithms (Cormen, et al), aka “CLRS”, esp. Sec 30.5 “An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA” (Liu, et al.)