# Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*

## Presentation on theme: "Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*"— Presentation transcript:

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*
September 4, 1997 Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs* Jeremy R. Johnson Wed. Feb. 14, 2001 *Parts of this lecture was derived from chapters IX in Lipson. Feb. 14, 2001 Parallel Processing

September 4, 1997 Introduction Objective: To derive and implement a shared-memory parallel program for computing the fast Fourier transform (FFT). Topics Derivation of the FFT Recursive version Iterative version A parallel divide & conquer algorithm using threads A parallel loop version using OpenMP Obtaining additional parallelism Feb. 14, 2001 Parallel Processing

FFT as a Matrix Factorization
Compute y = Fnx, where Fn is n-point Fourier matrix. Feb. 14, 2001 Parallel Processing

Matrix Factorizations and Algorithms
function y = fft(x) n = length(x) if n == 1 y = x else % [x0 x1] = L^n_2 x x0 = x(1:2:n-1); x1 = x(2:2:n); % [t0 t1] = (I_2 tensor F_m)[x0 x1] t0 = fft(x0); t1 = fft(x1); % w = W_m(omega_n) w = exp((2*pi*i/n)*(0:n/2-1)); % y = [y0 y1] = (F_2 tensor I_m) T^n_m [t0 t1] y0 = t0 + w.*t1; y1 = t0 - w.*t1; y = [y0 y1] end Feb. 14, 2001 Parallel Processing

Rewrite Rules Feb. 14, 2001 Parallel Processing

FFT Variants Cooley-Tukey Recursive FFT Iterative FFT
Vector FFT (Stockham) Vector FFT (Korn-Lambiotte) Parallel FFT (Pease) Feb. 14, 2001 Parallel Processing

Tensor Permutations A natural class of permutations compatible with the FFT. Let  be a permutation of {1,…,t} Mixed-radix counting permutation of vector indices Well-known examples are stride permutations and bit-reversal. Feb. 14, 2001 Parallel Processing

Example (Stride Permutation)
Feb. 14, 2001 Parallel Processing

Example (Bit Reversal)
Feb. 14, 2001 Parallel Processing

Iterative Cooley-Tukey Algorithm
September 4, 1997 Iterative Cooley-Tukey Algorithm R Stage 0 Stage 1 Stage 2 Stage 3 Feb. 14, 2001 Parallel Processing

Iterative Cooley-Tukey Algorithm
September 4, 1997 Iterative Cooley-Tukey Algorithm R Stage 0 Stage 1 Stage 2 Stage 3 Feb. 14, 2001 Parallel Processing

Modified Pease Algorithm
September 4, 1997 Modified Pease Algorithm Stage 0 Stage 1 Stage 2 Stage 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 Feb. 14, 2001 Parallel Processing

Iterative Implementation
function y = ifft2(x) % Input: x a vector of length n. n = 2^t, t an integer, t >= 0. % Output: y = F_{2^t} x % Algorithm: Iterative. % F_{2^t} = { Prod_{c=1}^t I_{2^{t-c}}) % T^{2^{t-c+1}}_{2^{t-c}}) } R^{2^t} n = length(x); t = ceil(log2(n)); xt = bitreversal(x); yt = zeros(n,1); for c=t:-1:1 m = 2^(c-1); p = 2^(t-c); % W = W_p(omega_{2p}) W = exp((2*pi*i)/(2*p)*-(0:p-1)'); % yt = I_p)xt for j=0:m-1 % y^{2p}_{j*2p+1} = I_p)T^{2p}_p x^{2p}_{j*2p+1} % = I_p)(I_p \$ W) x^{2p}_{j*2p+1} xt((j*2+1)*p+1:(j+1)*2*p) = W .* xt((j*2+1)*p+1:(j+1)*2*p); yt(j*2*p+1:(j*2+1)*p) = xt(j*2*p+1:(j*2+1)*p) + xt((j*2+1)*p+1:(j+1)*2*p); yt((j*2+1)*p+1:(j+1)*2*p) = xt(j*2*p+1:(j*2+1)*p) - xt((j*2+1)*p+1:(j+1)*2*p); end xt = yt; y = yt; Feb. 14, 2001 Parallel Processing

Iterative Implementation
function y = ipfft2(x) % In-place Pease FFT algorithm. % Input: x a vector of length n. n = 2^t, t an integer, t >= 0. % Output: y = F_{2^t} x % Algorithm: Conjugated Pease. % F_{2^t} = { Prod_{c=1}^t F_2)T_c L^n_{2^c} R^{2^t} % n = length(x); t = ceil(log2(n)); y = bitreversal(x); w = exp(-2*pi*i/n); for c=t-1:-1:0 for r=0:2^(t-1)-1 r0 = mod(r,2^c); r1 = floor(r/2^c); a0 = r0*2^(t-c) + r1; a1 = a0 + 2^(t-c-1); y0 = y(a0+1); y1 = w^(r1*2^c) * y(a1+1); y(a0+1) = y0 + y1; y(a1+1) = y0 - y1; end Feb. 14, 2001 Parallel Processing