Presentation on theme: "The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic."— Presentation transcript:
The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic processor cores, capable of massive floating point processing, optimized for compute- intensive workloads and broadband rich media applications
Dense matrix multiplication is one of the most common numerical operations and important algorithms. Cell B.E excels in its capabilities to process compute-intensive workloads like matrix multiplication in single precision through its powerful SIMD capabilities
Computational micro-kernels are architecture specific codes when used with systematic analysis of problem combined with exploitation of low-level features of synergistic processing unit of cell B.E leads of dense matrix multiplication kernels achieving peak performance.
Introducing highly optimized cell B.E implementations of two classic dense linear algebra computations, Cholesky factorization QR factorization
Work has been done to prove that a silicon chip can provide great performance for compute-intensive scientific workloads by combining short-vector single instruction multiple data with multicore architecture. SPEs allow implementation of complex synchronization mechanisms, task level parallelism.
Hybrid GPU based multicore platforms which has both homogeneous multicores and GPUs provide effective solution for challenges of appetite power and gap between compute and communication speeds and hence is the trend taken by GPUs and hybrid combinations of GPUs with homogeneous multicores is appreciated as it can freeze the frequency escalate the number of cores, provide data parallelism high bandwidth
The development of dense linear algebra algorithms for GPUs is done where the approach is based on development of hybrid algorithms where in general small, non-parallelizable tasks are executed on CPU and data parallel tasks are executed on GPU and it uses CUDA to develop low-level kernels and high- level libraries like LAPACK and BLAS
Approach to develop high performance BLAS for GPUs which is essential to enable GPU-based hybrid approaches in area of dense linear algebra. Important issues for design of kernels-blocking and coalesced memory access are discussed Three optimization techniques of implementations of BLAS-pointer redirecting, padding and auto-tuning are discussed
Sparse matrix vector multiplication (SpMV) is an interesting computation as it appears in scientific and engineering, financial, economic modeling and information retrieval applications. The level of performance is achieved through the diversity of architectural designs and input matrix characteristics i.e complex combination of architecture and matrix specific techniques
A comparison for better performance is done on different platforms across the suite of matrices and it is evident that the optimized implementations deliver better performance and it is also observed that bandwidth is the determining performance factor
The accurate simulation of real world phenomena in computational science is based on mathematical model that has a set of partial differential equations and finite element methods are considered to be the most promising approaches for numerical treatment of partial differential equations.
Graphics processing units are considered to be working well in such cases and in order to achieve peak performance, selection of proper data structures, parallelization techniques especially when combining coarse grained parallelism on cluster level and medium and fine grained parallelism between CPU cores and within accelerated drivers like GPUs
The way of applying fine grained parallelization techniques for robust multigrid solvers which are numerically strong like sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements
Parallelization techniques are implemented on graphics processors as representatives of throughput oriented wide SIMD many-core architectures as GPUs offer a tremendous amount of fine-grained parallelism. Here the NVIDIA CUDA is being used where the concepts of memory coalescing, wraps, shared memory and thread blocks are encountered
Design of efficient parallel implementation of Fast Fourier Transform(FFT) on cell/B.E and it is a fundamental kernel in computationally intensive scientific applications like computer tomography, data filtering, fluid dynamics, spectral analysis of speech, sonar, radar, seismic, vibration detection, digital filtering, signal decomposition, PDEs
An interactive approach is used to solve 1D FFT that divides the work among SPEs to efficiently parallelize FFT computation and it requires synchronization among SPEs after each stage of FFT computation where the computation of SPEs is fully vectorized with other optimization techniques such as loop unrolling and double buffering.
A way in which the FFT can exploit typical parallel resources on multicore architecture platforms to achieve near-optimal performance for which designers have to adopt a systematic approach that takes into account the attributes of both the application and target system.
A successful implementation lies on deep understanding of data access patterns, computation properties, available hardware resources where it can take advantage of generalized performance planning techniques to produce successful implementation across a wide variety of multicore architectures.
Combinatorial algorithms play important role in scientific computing for efficient parallelization of linear algebra, computational physics, numerical optimization computations, massive data analysis routines, systems biology, the study of natural phenomena involving networks and complex interactions
A complexity model to simplify design of algorithms on cell/B.E multicore architecture and a systematic procedure to evaluate performance is presented. In order to get the execution time of algorithm, the computational complexity, memory access patterns and complexity of branching instructions are considered.
The application of auto-tuning to the 7- and 27- point stencils on widest range of multicore architectures where the chip multiprocessors lie at extremes of spectrum of design tradeoffs that range from replication of existing core technology to employing large numbers of simple cores and novel memory hierarchies.
Important aspects are parallelism discovery, selecting from various forms of hardware parallelism and enabling memory hierarchy optimizations, made more challenging by separate address space, software managed memory local stores and NUMA features that appear in multicore systems.
Multi core and many core and heterogeneous micro architecture is very important in hardware landscape. Specialized processing units such as commodity graphics processing units are proved to compute accelerators that are capable of solving specific scientific problems orders of magnitude faster than conventional CPUs
Hyperthermia is a relatively new treatment modality which is used as complementary therapy to radio or chemo therapies. Here we study the optimizations of a computational kernel appearing within biomedical application hyperthemia cancer treatment on NVIDIAs graphic processing unit
The implementation and results of two bioinformatics applications, namely FASTA for the Smith-Watersman kernel and ClustalW. The results show that cell/B.E is an attractive avenue for bioinformatics applications. A cell/B.E is considered to be a power-efficient platform provided that the total power consumption of cell/B.E is less than super scalar processor.
Also the implementation of the CustalW running on cell/B.E that uses software caches inside SPEs for data movement is described. Using the software caches enhances the programmer productivity without major decrease in performance.
Efficient and scalable strategies to orchestrate all- pairs computations on cell architecture, based on decomposition of the computations and input entries is described. General case is to schedule computations on cell processor and to extend the strategies to incorporate cases when number of input entries is large and size of individual entries is too large to fit memory limitations of SPEs
The performance results showed that cell processor is a good platform to accelerate various kinds of applications dealing with pairwise computations. The all-pairs computations strategies can be applied to many applications from a wide range of areas which requires such computations to be performed.
The main applications of drug design are figured and two practical case studies, FTDock and Moldy, which are a docking and a molecular dynamics application are discussed. The advantages of using cell B.E in the drug design are noticed.
Regarding FTDock, a 3x speedup is achieved compared to a parallel version running on a POWER5 multicore with two 1.5GHz POWER5 chips with 16GB of RAM. Moldy on cell BE consumes less power and takes same time as an MPI parallelization on four Itanium Montecito processors of SGI Altix 4700
GPUs are parallel computing devices capable of accelerating a wide variety of data-parallel algorithms and their tremendous computing capabilities help accelerate molecular modeling applications, enabling molecular dynamics simulations and their analyses to run much faster than before and allowing use of scientific techniques that are impractical on conventional hardware platforms.
Most computationally expensive algorithms used in molecular modeling are presented and explained how these algorithms may be reformulated as arithmetic intensive, data parallel algorithms capable of achieving high performance on GPUs. In coming years, we expect GPU hardware architecture to continue to evolve rapidly and become increasingly sophisticated.
Biomedical applications are an important focus for high performance computing(HPC) researchers. The use of accelerators, with their low cost and high performance is possible solution for investigating methods to provide high performance.
It is clear that the data flow programming model and associated runtime systems can, at multiple application and hardware granularities, ease the implementation of challenging biomedical applications for these types of computational resources. GPU is designed to deliver maximum performance through its SIMD architecture.
The charm++ parallel programming model and runtime system to support accelerators and heterogeneous clusters that include accelerators is presented. Also several extensions to charm++ programming model, including SIMD instruction abstraction, accelerated entry methods and accelerated blocks are presented.
The important concept is that the support for CUDA based GPUs is presented where all these extensions are continuing to be developed and improved upon, as we increase support for heterogeneous clusters in charm++.
The modern many-core GPUs are massively parallel processors where the CUDA programming model provides a straightforward way of writing scalable parallel programs to execute on GPU. Data parallel techniques provide convenient way of expressing such parallelism.
The design of efficient scan and segmented scan routines which are essential primitives in a broadband range of data parallel algorithms is presented and thus by tailoring the existing algorithms to natural granularities of machine and by minimizing synchronization, one of the fastest scan and segmented scan algorithms are designed for GPU.
The performance evaluation of the interprocess communication mechanism for modern multicore CPUs is analyzed. It is observed that the streaming instructions are expected to deliver good performance where the current implementation generates a high number of resource stalls and hence low performance.
It is also found that intra-node communication performance is highly dependant on memory and cache architecture and also the way how the improvements in processor and interconnect technology have affected the balance of computation to communication performance is presented.