FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
PRAM (Parallel Random Access Machine)
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Introduction to Parallel Processing Ch. 12, Pg
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Module I Overview of Computer Architecture and Organization.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Basic logical operations Operation Mechanism Through the combination of circuits that perform these three operations, a wide range of logical circuits.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
RAM, PRAM, and LogP models
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Parallel Programming with MPI and OpenMP
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Programmable Logic Controllers LO1: Understand the design and operational characteristics of a PLC system.
File Systems cs550 Operating Systems David Monismith.
Overview von Neumann Architecture Computer component Computer function
Group 1 chapter 3 Alex Francisco Mario Palomino Mohammed Ur-Rehman Maria Lopez.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Basic Communication Operations Carl Tropper Department of Computer Science.
Parallel Computing Presented by Justin Reschke
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
CS 704 Advanced Computer Architecture
CSC 4250 Computer Architectures
Embedded Systems Design
Cache Memory Presentation I
Bruhadeshwar Meltdown Bruhadeshwar
Computer Architecture
Parallel Programming in C with MPI and OpenMP
Overview Parallel Processing Pipelining
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
RTL Design Methodology Transition from Pseudocode & Interface
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007

FPGA: Overview □Work done □Structure of a sample program □Ongoing Work □Next Step

FPGA : work done □Register handling and console IO □Modified simple.c □Implemented an adder □Used VirtualBase member of ADMXRC2_SPACE_INFO □Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga

Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …

Ongoing work : ZBT □Structure of zbt_main seems to be similar to simple.c □zbt.vhd is a wrapper for zbt_main.vhd □Same port names defined in the same way and port mapped to each other □Do not understand the reason for this wrapper □C code not available in ADMXRC2 demos □Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code

Next Step □To work with zbt and block RAMs □FFT implementation on the FPGA

Multiprocessor FFT Overview □Some improvements to the existing code □Improve the theoretical model □Compare theoretical run-time with actual run time □Statistics of each processor □Further refinement: Using BSP model □Pointers for Cache Analysis

Optimizations to the code □Removed other arrays (reducing memory references considerably) □Twiddle factors □Bit reversal addresses □Bit reversal faster using bit operations O(1) for each address calculation □All multiplications/divisions involving 2 implemented using shift operations O(1) □Power (2^n) in constant time using bit operations O(1)

Previously…

Now…

Improvement □For larger input size, our program (radix-2) is comparable to FFTW □Our program might surpass FFTW □Using SIMD □Higher radix (e.g. 4,8,16) □Coding in C

Redefining the execution time □For p processors, the total execution time is : (T N /p) + (1 – 1/p)(2N/B + K N ) □p is a power of 2 □This assumes “RAM Model” □Assumes a flat memory address space with unit- cost access to any memory location □We did not take into account the memory hierarchy □E.g. matrix multiplication actually takes O(n 5 ) instead of expected O(n 3 ) [Alpern et al. 1994]

Redefining the execution time □Some observations □If the #processors are p, then the actual FFT computed if FFT(N/p)  time taken is T N/ p and NOT T N / p □Time taken to combine (O(n) in RAM model) should be taken as: Σ K N/2 i (i = 1 to log p) □NOT included the synchronization time □Currently looking execution time only from the perspective of master processor □The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible

New Theoretical Formula □Time taken for parallel execution with p processors is T N/p + (1-1/p)(2N/B) + ΣK N/2 i (i = 1 to log p)

Execution Time:

Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=20.865T= T=26.591T= T= T= T=35.808T=35.555

Load Distribution: Processor 1

Load Distribution: Processor 2

Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=20.773T=26.464T= T= T= T= T= T=29.532T= T= T= T=31.045T=33.96 T= T= T= Recv(4) T= T= T=39.85 T= T=40.120

Load Distribution: Processor 1

Load Distribution: Processor 2

Load Distribution: Processor 3

Load Distribution: Processor 4

Execution Time:

Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T= T= T= T= T= T= T= T=

Load Distribution: Processor 1

Load Distribution: Processor 2

Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=70.881T=91.281T= T= T= T= T= T=97.896T= T= T= T= T= T= T= T= Recv(4) T= T= T= T= T=

Load Distribution: Processor 1

Load Distribution: Processor 2

Load Distribution: Processor 3

Load Distribution: Processor 4

Execution Time:

Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T= T= T= T= T= T= T= T=

Load Distribution: Processor 1

Load Distribution: Processor 2

Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T= T= T= T= T= T= T= T= T= T= T= T= T= T= T= T= Recv(4) T= T= T= T= T=

Load Distribution: Processor 1

Load Distribution: Processor 2

Load Distribution: Processor 3

Load Distribution: Processor 4

Inference □The idle time is very less (for processor 1) □The theoretical model matches with actual results □But, we need to find a closed form solution for T N and K N

Calculating T N and K N □Depends upon □N : Size of the input □A: Cache Associativity □L: Cost incurred for a miss □M: Size of the cache □B: Number of Bytes it can transfer at a time

Contd… □Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses □We have this table (computed in the summers) □We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number

Theoretical Verification □S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” □It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) □Still reading it

Modeling using BSP □BSP (Bulk Synchronous Parallel) model considers □The whole job as a series of supersteps □At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished

Modeling using BSP □BSP model uses the following parameters – □p the number of processors (p = ^2 for us) □w t the maximum local work performed by any processor □L the time machine needs for barrier synchronization (determined experimentally) □g the network bandwidth inefficiency (reciprocal of B,determined experimentally)

Modeling using BSP Send(2) Recv(1) Send(3) Send(4) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(1) Recv(3) Combine Recv(1) Send(1) Combine barrier step 0step 1step 2step 3step 4step 5step 6

Execution time □Step 0: L □Step1: L+max(time(Send(2)),time(Recv(1))) □Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) □Step 4: L+max(FFT i (N/p)) (0<=i<=p-1) □Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) □Step 6: L+max(time(combine i (N/4)) (i={1,2}) □Step 7: L+max(time(Send(1)),time(Recv(2))) □Step 8: L+ time(combine(N/2))

Generalizing this for p processors event(t) communications 0<= t < logp compute FFT(N/p) t = logp communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)

for t< logp Total # of steps = 2 t Sends and 2 t Recvs let time(send(N,i)) denote the time taken to send N data points to processor i let time(recv(N,j)) denote the time taken to receive N data points from parocessor j Total time taken for this group = ∑ max{time(send(N/(2 t+1 ),j-), time(send(N/(2 t+1 ), i-1))} +L(logp) 0<j<=2 t 2 t <i<=2 t+1 t=0 t=log p -1

t = logp □Let time(FFT i (N/p)) denote the time taken to compute FFT of size N/p on processor i □thus, time taken to calculate FFT of size N/p is max{FFT i (N/p)} + L 0<= i<= p-1

for t>logp (t-logp is odd) Time taken is only for communications Total time taken is ∑ max{time(send(N/h,j-1),time(recv(N/h,i-1))} +L(logp) 0<j<=h/2 h/2<i<=h t=log p +1 t=3log p -1 where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function

for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0<i<=h

Execution Time □The total time is the sum of all the above steps □In general, there would be 3(logp) steps □The actual time depends upon how well a particular part of the program schedules on a particular processor □(i.e.) the processing time can vary

Further Work □Formalize the BSP model for p divisions □Combine Inplace (using realloc) □Compare parallel FFT against parallel FFTW

References □S.Sen, S.Chatterjee, N.Dumir, 2000.Towards a Theory of Cache- Efficient Algorithms □Michael J. Quinn, Parallel Programming in C with MPI and OpenMP □L.G. Valiant, A bridging model for parallel computation

Thank You