Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011.

Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011

Sources J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM. J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In COCOON, pages 270--281, 1995. CS256 Applied Theory of Computation Brown University. Lecture 18 (http://www.cs.brown.edu/courses/csci2560/lectures/lect.18.MemoryHierarchyIII. pdf) John E. Savage Models of Computation Exploring the Power of Computing A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988.

Outline 1. Fast Fourier Transform 2. Lower bound 1. Two-level pebble game 2. S-span 3. Upper bound 4. Multilevel pebble game 5. Open Problems

Output Vector Input Vector Discrete Fourier Transform

... Unroll Output Vector

Unroll Input Vector

Phrase as Matrix-Vector Multiply INPUT VECTOR OUTPUT VECTOR

DFT Factorization INPUT VECTOR OUTPUT VECTOR

Factorization INPUT VECTOR OUTPUT VECTOR DFT +* x0x0 x0x0 x1x1 x1x1

Factorization INPUT VECTOR OUTPUT VECTOR DFT +* DFT +*

FFT OUTPUT VECTOR +* INPUT VECTOR +* Shuffle Compute

Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose) Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache. Red Blue (2-level) Pebble Game

Used to analyze communication in straight-line programs (e.g. Matrix multiply, FFT, matrix transpose) Played on a DAG. Vertices represent inputs, intermediate data, and operations. Edges represent data dependencies Pebbles represent cache locations. Pebble color represents a distinct level of the cache hierarchy. Placing a pebble on a specific vertex means storing that data element in cache. Red Pebble (Fast Memory) Blue Pebble (Slow Memory)

Rules of the Red Blue Pebble Game (Initialization) A blue pebble can be placed on any input vertex at any time (Input) A red pebble may be placed on any vertex that contains a blue pebble (Output) A blue pebble may be placed on any vertex that contains a red pebble (Computation) A red pebble can be placed on any vertex if all of its immediate predecessors have red pebbles (Deletion) A pebble can be removed at any time (Goal) All output vertices contain blue pebbles

Playing the Game A pebbling strategy is a sequence of steps in which the rules on the previous slide are used to move pebbles The number of red pebbles (size of fast memory) is limited to S (assume infinite blue pebbles). A communication lower bound (or Minimum I/O Time) is determined by proving the minimum number of (Input) and (Output) rules invoked over all possible pebbling strategies. The total number of computation steps should also be minimized

S-span The S-span of DAG G, ρ(S,G), is the maximum number of vertices of G that can be pebbled with S red pebbles in red pebble game maximized over all initial placements of S red pebbles. Red pebble game is like the red blue game but blue pebbles cannot be stored on intermediate vertices. Red Pebble Initial red pebble (S=6)

Using S-span for Lower Bounds Divide the computation into h sub-pebblings (C 1, C 2...C h ) that each communicate no more than S words between level 1 and 2. Each sub-pebbling has 2S words available (S words initially in the cache plus S inputs). Therefore, each sub-pebbling can perform no more than ρ(2S,G) operations. C1C1 C2C2 C3C3 C4C4 C5C5 C6C6... Input Level-1 ops Output ChCh

Theorem For every pebbling P of G = (V,E) in the red-blue pebble game with S red pebbles, the I/O time used, T 2 (S,G,P) satisfies: Number of words moved (In batches of S words) Upper bound on arithmetic intensity (number of operations per 2S words) Total number of operations Using S-span for Lower Bounds

What is the S-span of the FFT DAG? Lemma 1: The S-span of the FFT DAG on n inputs is no greater than 2 S log(S) when S < n. Proof: Let num(p) denote the number of moves currently allocated to pebble p. Both p 1 and p 2 are moved to the upper level nodes v 1, and v 2. (Illegal, but an upper bound) If num(p 1 ) = num(p 2 ) then increment both. Otherwise increment the smaller. The total number of red pebbling moves is therefore bounded by: v1v1 v2v2 u2u2 u1u1 p1p1 p2p2

What is the S-span of the FFT DAG? Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2 num(p)

What is the S-span of the FFT DAG? Lemma 2: For each pebble p on node n in the FFT DAG, the number of nodes, N(p), that contained a red pebble in the initial configuration and that are connected by a directed path to n is at least 2 num(p) Proof (Induction): Base case: num(p) = 1. In this case, the node n needed 2 inputs. Inductive step: Assume that N(p) is at least 2 e-1 for some value of num(p) < e-1. Show that N(p) becomes at least 2 e when num(p) is incremented to e during a butterfly operation. Case 1: Pebbles p1 and p2 enter a butterfly operation with num(p1)=num(p2)=e-1. Since u1 and u2 are roots of disjoint trees with at least 2 e-1 initial pebbles, the total number of initial pebbles is now 2(2 e-1 ) = 2 e pebbles. Case 2: num(p) e therefore the partner must have been connected to at least 2 e initial pebbles.

What is the S-span of the FFT DAG? There are S pebbles and each pebble can only cover one initial placement. Therefore num(p) < log(S), because there must be at least 2 num(p) initial pebbles. (Lemma 2) According to Lemma 1, the total number of pebbling moves is bounded by: So the S-span is 2 S log(S). QED

FFT Two-level Hierarchy Lower Bound Number of words moved

Transpose FFT

Transpose FFT (Upper Bound) Suppose the FFT size is a power of 2. (N = 2 d ) There are log(N) levels in the FFT DAG. Divide the large FFT into many FFTs of size S, where S is the size of fast memory. There are log(N)/log(S) stages of independent size-S FFTs. After each stage, store the outputs in slow memory for a total of N log(N)/log(S) words moved between fast and slow memory, which achieves the lower bound.

Multilevel Pebble Game Red/blue pebble game was for 2 levels (fast and slow) For multilevel game, data begins and ends in the highest level memory (the L th ) and can be transferred between consecutive levels (l-1 to l or vice versa) Level-1 (Registers) Level-L (Main Memory) Level-2 (On-chip cache)...

Rules of the Multilevel Pebble Game (Initialization) A level-L pebble can be placed on any input vertex at any time (Computation) A first-level pebble can be placed on any vertex if all of its immediate predecessors have first-level pebbles (Deletion) Except for level-L pebbles on output vertices, a pebble at any level can be removed at any time (Input from level-l) For 2 < l < L-1, a level-(l-1) pebble can be placed on any vertex carrying a level-l pebble (Output to level-l) For 2 < l < L-1, a level-(l) pebble can be placed on any vertex carrying a level-(l-1) pebble (Goal) All output vertices contain level-L pebbles

Terminology Resource Vector p = (p 1, p 2, p 3,... p L-1 ) where p l is the number of pebbles at level l. (Highest level is assumed infinite) s l = sum of all available pebbles below level-l Minimal Pebbling assumes that the number of highest level I/O operations is minimized, the number of I/O operations is minimized at successively lower levels and number of computation steps is minimized. T l = Number of I/O operations at level l

Ignore Higher Levels Lemma: Let S min be the minimum number of pebbles to pebble G = (V,E) in the red- pebble game. If the number of pebbles at level k < L or less, s k, exceeds S min + (k-1), a minimal pebbling P(p,G) with resource vector p in the L-level game does not perform I/O operations at level k+1 or higher except on inputs and outputs. (If the working set fits in some level of the memory hierarchy k, then we only need to communicate inputs and outputs from and to the levels higher than k.)

Multilevel S-Span Theorem: Consider a minimal pebbling of the DAG G = (V,E) in the standard memory hierarchy game with resource vector p using s l pebbles at level l or less. The following lower bound must be satisfied: C1C1 C2C2 C3C3 C4C4 C5C5 C6C6... Input Level l-1 ops Output Level l sub-pebblings ChCh

Relating Multilevel to 2-level Theorem: The following inequality holds for 2 < l < L-1 when the graph G is pebbled in the L-level game with resource vector p.

Review The minimum I/O time for the FFT in the 2-level case is N log N / log S This was determined by finding the S-span of the FFT graph using it to bound the number of words transferred between memory levels The standard FFT algorithm achieves this lower bound (so the lower bound is tight) Two-level lower bounds can be generalized to multi-level memory hierarchies

Open Problems Communication lower bounds for 2-D and 3-D FFTs I suspect that S-span argument also holds for 2-D case What if S is larger than one row? Determining the FFT lower bound for the parallel model described in this class Lower bounds for a “parallel hierarchal memory model” using randomized sorting algorithms for communication can be found here: J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory II: Hierarchical Multilevel Memories” Using the pebble game (S-span) method to analyze new algorithms Matrix Multiply and sorting and several other examples can be found in the references listed earlier

Questions?

Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011.

Similar presentations

Presentation on theme: "Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011.

Similar presentations

Presentation on theme: "Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011."— Presentation transcript:

Similar presentations

About project

Feedback