2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Slides:

Advertisements

Similar presentations

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Advertisements

1 Computational models of the physical world Cortical bone Trabecular bone.

© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Appro Xtreme-X Supercomputers A P P R O I N T E R N A T I O N A L I N C.

Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.

CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

Parallel Computers Past and Present Yenchi Lin Apr 17,2003.

An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.

May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.

CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 High Performance Interconnects for Distributed Computing (HPI-DC)

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

CCS machine development plan for post- peta scale computing and Japanese the next generation supercomputer project Mitsuhisa Sato CCS, University of Tsukuba.

Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Interconnection network network interface and a case study.

High Performance Computing

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

1 Next Generation Correlators, June 26 th −29 th, 2006 The LOFAR Blue Gene/L Correlator Stichting ASTRON (Netherlands Foundation for Research in Astronomy)

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Appro Xtreme-X Supercomputers

Super Computing By RIsaj t r S3 ece, roll 50.

Parallel Computers Today

BlueGene/L Supercomputer

Cluster Computers.

Presentation transcript:

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan

2009/4/21 Third French-Japanese PAAP Workshop 2 Outline Background Objectives Approach 3-D FFT Algorithm Volumetric 3-D FFT Algorithm Performance Results Conclusion

2009/4/21 Third French-Japanese PAAP Workshop 3 Background The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. Parallel 3-D FFT algorithms on distributed- memory parallel computers have been well studied. November 2008 TOP500 Supercomputing Sites –Roadrunner: 1, TFlops (129,600 Cores) –Jaguar (Cray XT5 QC 2.3GHz): 1, TFlops (150,152 Cores) Recently, the number of cores keeps increasing.

2009/4/21 Third French-Japanese PAAP Workshop 4 Background (cont ’ d) A typical decomposition for performing a parallel 3-D FFT is slabwise. –A 3-D array is distributed along the third dimension. – must be greater than or equal to the number of MPI processes. This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors.

2009/4/21 Third French-Japanese PAAP Workshop 5 Related Works Scalable framework for 3-D FFTs on the Blue Gene/L supercomputer [Eleftheriou et al. 03, 05] –Based on a volumetric decomposition of data. –Scale well up to 1,024 nodes for 3-D FFTs of size 128x128x D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07] –3-D FFTs of size 128x128x128 can scale well on QCDOC up to 4,096 nodes.

2009/4/21 Third French-Japanese PAAP Workshop 6 Objectives Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors. Reduce the communication time for larger numbers of MPI processes. A comparison between 1-D and 2-D distribution for 3-D FFT.

2009/4/21 Third French-Japanese PAAP Workshop 7 Approach Some previously presented volumetric 3-D FFT algorithms [Eleftheriou et al. 03, 05, Fang07] uses the 3-D distribution for 3-D FFT. –These schemes require three all-to-all communications. We use a 2-D distribution for volumetric 3-D FFT. –It requires only two all-to-all communications.

2009/4/21 Third French-Japanese PAAP Workshop 8 3-D FFT 3-D discrete Fourier transform (DFT) is given by

2009/4/21 Third French-Japanese PAAP Workshop 9 1-D distribution along z-axis 1. FFTs in x-axis2. FFTs in y-axis3. FFTs in z-axis With a slab decomposition

2009/4/21 Third French-Japanese PAAP Workshop 10 2-D distribution along y- and z-axes 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a volumetric domain decomposition

2009/4/21 Third French-Japanese PAAP Workshop 11 Communication time of 1-D distribution Let us assume for -point FFT: –Latency of communication: (sec) –Bandwidth: (Byte/s) –The number of processors: One all-to-all communication Communication time of 1-D distribution (sec)

2009/4/21 Third French-Japanese PAAP Workshop 12 Communication time of 2-D distribution Two all-to-all communications – simultaneous all-to-all communications of processors in y-axis. – simultaneous all-to-all communications of processors in z-axis. Communication time of 2-D distribution (sec)

2009/4/21 Third French-Japanese PAAP Workshop 13 Comparing communication time Communication time of 1-D distribution Communication of 2-D distribution By comparing two equations, the communication time of the 2-D distribution is less than that of the 1-D distribution for larger number of processors and latency.

2009/4/21 Third French-Japanese PAAP Workshop 14 Performance Results To evaluate parallel 3-D FFTs, we compared –1-D distribution –2-D distribution and -point FFTs on from 1 to 4,096 cores. Target parallel machine: –T2K-Tsukuba system (256 nodes, 4,096 cores). –The flat MPI programming model was used. –MVAPICH was used as a communication library. –The compiler used was Intel Fortran compiler 10.1.

2009/4/21 Third French-Japanese PAAP Workshop 15 T2K-Tsukuba System Specification –The number of nodes: 648 （ Appro Xtreme-X3 Server ） –Theoretical peak performance: 95.4 TFlops –Node configuration: 4-socket of quad-core AMD Opteron 8356 (Barcelona 2.3 GHz) –Total main memory size: 20 TB –Network interface: DDR InfiniBand Mellanox ConnectX HCA x 4 –Network toporogy: Fat Tree –Full-bisection bandwidth: 5.18 TB/s

2009/4/21 Third French-Japanese PAAP Workshop 16 Computation Node of T2K-Tsukuba Bridge NVIDIA nForce 3050 Bridge NVIDIA nForce 3050 USB Dual Channel Reg DDR2 Hyper Transport 8GB/s (Full- duplex) PCI-X I/O Hub 8GB/s (A)2 (B)2 4GB/s (Full-duplex) 4GB/s (Full-duplex) (A)1 (B)1 4GB/s (Full-duplex) 4GB/s (Full-duplex) Bridge NVIDIA nForce 3600 Bridge NVIDIA nForce 3600 Bridge PCI-Express X16 PCI-Express X8 PCI-X X16 X8 X4 PCI-Express X16 PCI-Express X8 SAS X16 X8 X4 2GB 667MHz DDR2 DIMM x4 Mellanox MHGH28-XTC ConnectX HCA x2 ( 1.2µs MPI Latency, 4X DDR 20Gb/s ) Mellanox MHGH28-XTC ConnectX HCA x2 ( 1.2µs MPI Latency, 4X DDR 20Gb/s )

2009/4/21 Third French-Japanese PAAP Workshop 17

2009/4/21 Third French-Japanese PAAP Workshop 18 Discussion (1/2) For -point FFT, we can clearly see that communication overhead dominates the execution time. –In this case, the total working set size is only 1MB. On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT. –Performance on 4,096 cores is over 401 GFlops, about 1.1% of theoretical peak. –Performance except for all-to-all communications is over 10 TFlops, about 26.7% of theoretical peak.

2009/4/21 Third French-Japanese PAAP Workshop 19

2009/4/21 Third French-Japanese PAAP Workshop 20 Discussion (2/2) For, the performance of the 1-D distribution is better than that of the 2-D distribution. –This is because that the total communication amount of the 1-D distribution is a half of the 2-D distribution. However, for, the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency.

2009/4/21 Third French-Japanese PAAP Workshop 21

2009/4/21 Third French-Japanese PAAP Workshop 22 Conclusions We implemented of a volumetric parallel 3-D FFT on clusters of multi-core processors. We showed that a 2-D distribution improves performance effectively by reducing the communication time for larger numbers of MPI processes. The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors. We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT.