Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.

Slides:

Advertisements

Similar presentations

DFT & FFT Computation.

Advertisements

Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.

Lect.3 Modeling in The Time Domain Basil Hamed

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.

Particle acceleration in a turbulent electric field produced by 3D reconnection Marco Onofri University of Thessaloniki.

Chapter 8 Elliptic Equation.

Chapter 3 Image Enhancement in the Spatial Domain.

X i-2 x i-1 x i x i+1 x i+2 The finite difference approximation for the second derivative at point x i using a Taylor series is as follows: The weights.

5/4/2015rew Accuracy increase in FDTD using two sets of staggered grids E. Shcherbakov May 9, 2006.

Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.

Ionization of the Hydrogen Molecular Ion by Ultrashort Intense Elliptically Polarized Laser Radiation Ryan DuToit Xiaoxu Guan (Mentor) Klaus Bartschat.

Matrices: Inverse Matrix

P. Venkataraman Mechanical Engineering P. Venkataraman Rochester Institute of Technology DETC2011 –47658 Determining ODE from Noisy Data 31 th CIE, Washington.

Evan Walsh Mentors: Ivan Bazarov and David Sagan August 13, 2010.

Linear Algebraic Equations

Ch 7.8: Repeated Eigenvalues

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Inter-Processor communication patterns in weather forecasting models Tomas Wilhelmsson Swedish Meteorological and Hydrological Institute Sixth Annual Workshop.

October, 1998DARPA / Melamed / Singh1 Parallelization of Search Algorithms for Modeling QTES Processes Joshua Kramer and Santokh Singh Rutgers University.

Image Enhancement in the Frequency Domain Part I Image Enhancement in the Frequency Domain Part I Dr. Samir H. Abdul-Jauwad Electrical Engineering Department.

Special Matrices and Gauss-Siedel

Some Properties of the 2-D Fourier Transform Translation Distributivity and Scaling Rotation Periodicity and Conjugate Symmetry Separability Convolution.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Multi-Cluster, Mixed-Mode Computational Modeling of Human Head Conductivity Adnan Salman 1, Sergei Turovets 1, Allen Malony 1, and Vasily Volkov 1 NeuroInformatics.

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Chapter 1 Systems of Linear Equations

MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.

Overview and Mathematics Bjoern Griesbach

1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.

5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.

Numerical Grid Computations with the OPeNDAP Back End Server (BES)

Pseudospectral Methods

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

MA2213 Lecture 5 Linear Equations (Direct Solvers)

Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Solving the Poisson Integral for the gravitational potential using the convolution theorem Eduard Vorobyov Institute for Computational Astrophysics.

1 LES of Turbulent Flows: Lecture 2 Supplement (ME EN ) Prof. Rob Stoll Department of Mechanical Engineering University of Utah Fall 2014.

A particle-gridless hybrid methods for incompressible flows

SI 2008: Study of Wave Motion July 19, 2008 Martin Bobb Joseph Marmerstein Feibi Yuan Caden Ohlwiler.

SINGULAR VALUE DECOMPOSITION (SVD)

1 Complex Images k’k’ k”k” k0k0 -k0-k0 branch cut   k 0 pole C1C1 C0C0 from the Sommerfeld identity, the complex exponentials must be a function.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

1.1 The row picture of a linear system with 3 variables.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

Digital Image Processing CCS331 Camera Model and Imaging Geometry 1.

Image Enhancement in the Spatial Domain.

Numerical Algorithms Chapter 11.

Chapter 7. Classification and Prediction

5 Systems of Linear Equations and Matrices

Linear Filters in StreamIt

T. Chernyakova, A. Aberdam, E. Bar-Ilan, Y. C. Eldar

Lecture 19 MA471 Fall 2003.

Adnan Salman1 , Sergei Turovets1, Allen Malony1, and Vasily Volkov

topic11_shocktube_problem

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Presentation transcript:

Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from UC Berkeley for his teaching on parallel computing, and Vasily Volkov for his feedback on our homeworks. Parallel-FFT’s in 3D: Testing different implementation schemes Luis Acevedo-Arreguin, Benjamin Byington, Erinna Chen, Adrienne Traxler. Department of Applied Mathematics and Statistics, UC Santa Cruz 2. Transpose-Based parallel-FFT’s Schemes 3. Parallel version of FFTW 3.2, alpha Introduction Figure 2: Schematic representation of a naïve implementation of a 3D FFT. Rows, columns, and finally stacks are sent to slave processors, which perform 1D FFTs on the received vectors. The data are decomposed in vectors in the three directions. Each vector, whether a row, column, or stack, is sent to a slave processor. A master processor distributes tasks to each processor, receives the 1D FFT performed to each vector by the slave processors, and resend more work to those processors that ended their previous assignments. This scheme aims to compute 3D FFT to datasets that exhibit clustering, i.e. no homogeneity in terms of density, which consequently requires more intense computational work on some areas than others. This scheme is tested on NERSC Franklin by using module fftw Motivation Many problems in geophysics or astrophysics can be modeled using the Navier-Stokes equations. e.g. Earth’s magnetic field, mantle convection, solar convection zone, ocean circulation Various numerical methods can be employed to solve these problems. Because of geometry and symmetry, many numerical codes employ a spectral method to solve the fluid dynamics. Solutions can be written as sums of periodic functions and their corresponding weights (spectral coefficients) Provides solutions that are continuous (can obtain solutions that are at positions not along the grid) Particularly computationally challenging is the calculation of the non-linear terms in the equations Using spectral coefficients, the calculation of non- linear terms for a 3-d space is O(n 3 ) If non-linear terms are calculated in grid (real) space, calculations are reduced to O(n 2 ) Fast Fourier transforms (FFTs) O(n 2 log n) make it numerically tractable to perform frequent transforms from real to spectral space (forward) and from spectral to real space backward) The Parallel Problem If the code is serial: FFTW takes all the work out of optimizing the FFT In general: Want to have large domains to resolve the fluid dynamics for realistic geophysical/astro- physical constants Use large distributed clusters with large domains Would like to use FFTW because it is the fastest algorithm (in the West) - Autotuning is done for us Multiple schemes can be employed Transpose-based: Data in a single-direction is in- processor, once FFT is completed, re-decompose domain in another direction Parallel: Leave data in place and send pieces to other processors as necessary Which is faster? Not obvious at the outset… Figure 3: The starting setup: Data is initially decomposed into a stack of x-y planes, so each processor can do a local FFT in two dimensions. The data is then transposed so that each processor now holds a set of y-z slices, allowing the final transform (in the z direction) to be completed. For the reverse procedure, a similar process (transform, transpose, complete transform) is required. 1. A naïve parallel-FFT The most recent version of FFTW includes an implementation of a 3-D parallel-FFT. It is unclear how this is implemented: whether it completes a fully-parallel FFT (3-D domain decomposition) or implements a transpose-based FFT However, if it autotunes the implementation of the 3-D parallel FFT it allows for greater portability and hides cluster topology. Potentially provides a parallel implementation where the end-user needs no knowledge of parallelism in order to calculate the FFT. Case for serial FFT: use FFTW on any computer Figure 1: (left) Magnetic field polarity reversal caused by fluid motion in the Earth’s core (G. Glatzmaier, UCSC). (right) Layering phenomenon caused by double-diffusion similar to thermohaline staircases in the ocean (S. Stellmach, UCSC) The transpose-based parallel-FFT utilizes FFTW to perform a 2-D FFT (optimized in-processor implementation) and then redistrubutes all data for the FFT in the third dimension. The optimal communication scheme for transposing the third dimension to perform the final FFT is not clear. a) Complete all 2-D planes in-processor, call MPI_ALLTOALL to transpose the data, block until entire transformed data is obtained, perform final FFT b) Complete a single 2-D plane in-processor, call MPI_ALLTOALL to transpose each plane (overlapping communication with computation?), block until entire transformed data is obtained, perform final FFT c) Complete a single 2-D plane in-processor, utilize an explicit asynchronous communication scheme, perform final FFT as data is available.