The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

DSPs Vs General Purpose Microprocessors

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Introduction to Parallel Computing

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

System Development. Numerical Techniques for Matrix Inversion.

OpenFOAM on a GPU-based Heterogeneous Cluster

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU platforms GP - General Purpose computation using GPU

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Computer Architecture Organization and Architecture

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Parallel Programming Models

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Distributed Processors

Embedded Systems Design

STUDY AND IMPLEMENTATION

Introduction and History of Cray Supercomputers

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic processor cores, capable of massive floating point processing, optimized for compute- intensive workloads and broadband rich media applications

Dense matrix multiplication is one of the most common numerical operations and important algorithms. Cell B.E excels in its capabilities to process compute-intensive workloads like matrix multiplication in single precision through its powerful SIMD capabilities

Computational micro-kernels are architecture specific codes when used with systematic analysis of problem combined with exploitation of low-level features of synergistic processing unit of cell B.E leads of dense matrix multiplication kernels achieving peak performance.

Introducing highly optimized cell B.E implementations of two classic dense linear algebra computations, Cholesky factorization QR factorization

Work has been done to prove that a silicon chip can provide great performance for compute-intensive scientific workloads by combining short-vector single instruction multiple data with multicore architecture. SPEs allow implementation of complex synchronization mechanisms, task level parallelism.

Hybrid GPU based multicore platforms which has both homogeneous multicores and GPUs provide effective solution for challenges of appetite power and gap between compute and communication speeds and hence is the trend taken by GPUs and hybrid combinations of GPUs with homogeneous multicores is appreciated as it can freeze the frequency escalate the number of cores, provide data parallelism high bandwidth

The development of dense linear algebra algorithms for GPUs is done where the approach is based on development of hybrid algorithms where in general small, non-parallelizable tasks are executed on CPU and data parallel tasks are executed on GPU and it uses CUDA to develop low-level kernels and high- level libraries like LAPACK and BLAS

Approach to develop high performance BLAS for GPUs which is essential to enable GPU-based hybrid approaches in area of dense linear algebra. Important issues for design of kernels-blocking and coalesced memory access are discussed Three optimization techniques of implementations of BLAS-pointer redirecting, padding and auto-tuning are discussed

Sparse matrix vector multiplication (SpMV) is an interesting computation as it appears in scientific and engineering, financial, economic modeling and information retrieval applications. The level of performance is achieved through the diversity of architectural designs and input matrix characteristics i.e complex combination of architecture and matrix specific techniques

A comparison for better performance is done on different platforms across the suite of matrices and it is evident that the optimized implementations deliver better performance and it is also observed that bandwidth is the determining performance factor

The accurate simulation of real world phenomena in computational science is based on mathematical model that has a set of partial differential equations and finite element methods are considered to be the most promising approaches for numerical treatment of partial differential equations.

Graphics processing units are considered to be working well in such cases and in order to achieve peak performance, selection of proper data structures, parallelization techniques especially when combining coarse grained parallelism on cluster level and medium and fine grained parallelism between CPU cores and within accelerated drivers like GPUs

The way of applying fine grained parallelization techniques for robust multigrid solvers which are numerically strong like sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements

Parallelization techniques are implemented on graphics processors as representatives of throughput oriented wide SIMD many-core architectures as GPUs offer a tremendous amount of fine-grained parallelism. Here the NVIDIA CUDA is being used where the concepts of memory coalescing, wraps, shared memory and thread blocks are encountered

Design of efficient parallel implementation of Fast Fourier Transform(FFT) on cell/B.E and it is a fundamental kernel in computationally intensive scientific applications like computer tomography, data filtering, fluid dynamics, spectral analysis of speech, sonar, radar, seismic, vibration detection, digital filtering, signal decomposition, PDEs

An interactive approach is used to solve 1D FFT that divides the work among SPEs to efficiently parallelize FFT computation and it requires synchronization among SPEs after each stage of FFT computation where the computation of SPEs is fully vectorized with other optimization techniques such as loop unrolling and double buffering.

A way in which the FFT can exploit typical parallel resources on multicore architecture platforms to achieve near-optimal performance for which designers have to adopt a systematic approach that takes into account the attributes of both the application and target system.

A successful implementation lies on deep understanding of data access patterns, computation properties, available hardware resources where it can take advantage of generalized performance planning techniques to produce successful implementation across a wide variety of multicore architectures.

Combinatorial algorithms play important role in scientific computing for efficient parallelization of linear algebra, computational physics, numerical optimization computations, massive data analysis routines, systems biology, the study of natural phenomena involving networks and complex interactions

A complexity model to simplify design of algorithms on cell/B.E multicore architecture and a systematic procedure to evaluate performance is presented. In order to get the execution time of algorithm, the computational complexity, memory access patterns and complexity of branching instructions are considered.

The application of auto-tuning to the 7- and 27- point stencils on widest range of multicore architectures where the chip multiprocessors lie at extremes of spectrum of design tradeoffs that range from replication of existing core technology to employing large numbers of simple cores and novel memory hierarchies.

Important aspects are parallelism discovery, selecting from various forms of hardware parallelism and enabling memory hierarchy optimizations, made more challenging by separate address space, software managed memory local stores and NUMA features that appear in multicore systems.

Multi core and many core and heterogeneous micro architecture is very important in hardware landscape. Specialized processing units such as commodity graphics processing units are proved to compute accelerators that are capable of solving specific scientific problems orders of magnitude faster than conventional CPUs

Hyperthermia is a relatively new treatment modality which is used as complementary therapy to radio or chemo therapies. Here we study the optimizations of a computational kernel appearing within biomedical application hyperthemia cancer treatment on NVIDIAs graphic processing unit

The implementation and results of two bioinformatics applications, namely FASTA for the Smith-Watersman kernel and ClustalW. The results show that cell/B.E is an attractive avenue for bioinformatics applications. A cell/B.E is considered to be a power-efficient platform provided that the total power consumption of cell/B.E is less than super scalar processor.

Also the implementation of the CustalW running on cell/B.E that uses software caches inside SPEs for data movement is described. Using the software caches enhances the programmer productivity without major decrease in performance.

Efficient and scalable strategies to orchestrate all- pairs computations on cell architecture, based on decomposition of the computations and input entries is described. General case is to schedule computations on cell processor and to extend the strategies to incorporate cases when number of input entries is large and size of individual entries is too large to fit memory limitations of SPEs

The performance results showed that cell processor is a good platform to accelerate various kinds of applications dealing with pairwise computations. The all-pairs computations strategies can be applied to many applications from a wide range of areas which requires such computations to be performed.

The main applications of drug design are figured and two practical case studies, FTDock and Moldy, which are a docking and a molecular dynamics application are discussed. The advantages of using cell B.E in the drug design are noticed.

Regarding FTDock, a 3x speedup is achieved compared to a parallel version running on a POWER5 multicore with two 1.5GHz POWER5 chips with 16GB of RAM. Moldy on cell BE consumes less power and takes same time as an MPI parallelization on four Itanium Montecito processors of SGI Altix 4700

GPUs are parallel computing devices capable of accelerating a wide variety of data-parallel algorithms and their tremendous computing capabilities help accelerate molecular modeling applications, enabling molecular dynamics simulations and their analyses to run much faster than before and allowing use of scientific techniques that are impractical on conventional hardware platforms.

Most computationally expensive algorithms used in molecular modeling are presented and explained how these algorithms may be reformulated as arithmetic intensive, data parallel algorithms capable of achieving high performance on GPUs. In coming years, we expect GPU hardware architecture to continue to evolve rapidly and become increasingly sophisticated.

Biomedical applications are an important focus for high performance computing(HPC) researchers. The use of accelerators, with their low cost and high performance is possible solution for investigating methods to provide high performance.

It is clear that the data flow programming model and associated runtime systems can, at multiple application and hardware granularities, ease the implementation of challenging biomedical applications for these types of computational resources. GPU is designed to deliver maximum performance through its SIMD architecture.

The charm++ parallel programming model and runtime system to support accelerators and heterogeneous clusters that include accelerators is presented. Also several extensions to charm++ programming model, including SIMD instruction abstraction, accelerated entry methods and accelerated blocks are presented.

The important concept is that the support for CUDA based GPUs is presented where all these extensions are continuing to be developed and improved upon, as we increase support for heterogeneous clusters in charm++.

The modern many-core GPUs are massively parallel processors where the CUDA programming model provides a straightforward way of writing scalable parallel programs to execute on GPU. Data parallel techniques provide convenient way of expressing such parallelism.

The design of efficient scan and segmented scan routines which are essential primitives in a broadband range of data parallel algorithms is presented and thus by tailoring the existing algorithms to natural granularities of machine and by minimizing synchronization, one of the fastest scan and segmented scan algorithms are designed for GPU.

The performance evaluation of the interprocess communication mechanism for modern multicore CPUs is analyzed. It is observed that the streaming instructions are expected to deliver good performance where the current implementation generates a high number of resource stalls and hence low performance.

It is also found that intra-node communication performance is highly dependant on memory and cache architecture and also the way how the improvements in processor and interconnect technology have affected the balance of computation to communication performance is presented.