PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Slides:

Advertisements

Similar presentations

Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Today’s topics Single processors and the Memory Hierarchy

Beowulf Supercomputer System Lee, Jung won CS843.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Types of Parallel Computers

SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

QCD Project Overview Ying Zhang September 26, 2005.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Example: Sorting on Distributed Computing Environment Apr 20,

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.

Outline Why this subject? What is High Performance Computing?

- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Background Computer System Architectures Computer System Software.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Modern Processor Design: Superscalar and Superpipelining

Memory Opportunity in Multicore Era

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Hybrid Programming with OpenMP and MPI

Memory System Performance Chapter 3

Types of Parallel Computers

Cluster Computers.

Presentation transcript:

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout

NASA (NAS Devision)

NASA (NAS Devision) Aims NASA Advanced Supercomputing Division Develop, demonstrate, and deliver innovative computing capabilities to enable NASA projects and missions Demonstrate by the next millennium an operational computing system capable of simulating, in one to several hours, an entire aerospace vehicle system throughout its mission and life cycle.

NPB Introduction NAS Parallel Benchmarks suite (NPB) has been used widely to evaluate modern parallel systems Measure objectively the performance of highly parallel computers and to compare their performance with that of conventional supercomputers Consists of eight benchmark problems derived from important classes of Arophysics applications. NPB is based on Fortran 77 and the MPI message passing standard

Benchmark Problems EP Embarrassingly Parallel IS Integer sort CG Conjugate gradient MG Multigrid method for Poisson eqn FT Spectral method (FFT) for Laplace eqn BT ADI; Block-Tridiagonal systems SP ADI; Scalar Pentadiagonal systems LU Lower-Upper symmetric Gauss-Seidel

The Embarrassingly Parallel Benchmark (EP) In this benchmark, 2-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers. This problem requires almost no communication, in some sense this benchmark provides an estimate of the upper achievable limits for floating-point performance on a particular system. SP benchmark It is called the scalar pentadiagonal (SP) benchmark. In this benchmark, multiple independent systems of non-diagonally dominant, scalar pentadiagonal equations are solved. A complete solution of the SP requires 400 iteration. MultiGrid (MG) Benchmark MG uses a multigrid method to compute the solution of the three- dimensional scalar Poisson equation. This code is a good test of both short and long distance highly structured communication.

3-D FFT PDE (FT) Benchmark FT contains the computational kernel of a three dimensional FFT-based spectral method. BT Simulated CFD benchmark BT solve systems of equations resulting from an approximately factored finite difference discretization of the Navier-Stokes equations.

Class Benchmarks Since the 1991 specifications of NPB 1.0, computer speed and memory sizes have grown and correspondingly so have representative problem sizes. NPB 1.0 specifies two problem sizes for each benchmark – class “A” and a larger class “B”. The class A benchmarks can now be run on a moderately powerful workstation, and class B benchmarks on high-end workstations or small parallel systems. To retain the focus on high-end supercomputing, we now add a class “C” for all of the NAS benchmarks.

Weakness Points Implementations of the NAS Benchmarks are usually highly tuned by computer vendors largest problems (class B) no longer reflect the largest problems being done on present-day supercomputers

Why 8 Different Benchmarks ?

Comparing World Wide Clusters Loki and Hyglac In September 1996 two medium-scale parallel systems called “Loki” and “Hyglac” were installed. Each consisted of sixteen Pentium Pro (200 MHz) PCs with 16 Mbytes of memory and 3.2 and 2.5 Gbytes of disks per node, respectively. Each system was integrated using two fast Ethernet NICs in each node. Both sites had performed a complex N-body gravitational simulation of 2 million particles using an advanced tree-code algorithm. Each of these systems achieved a sustained performance of 1.19 Gflops and 1.26 Gflops, respectively. When the systems were connected together The same code was run again and achieved a sustained capability of over 2 Gflops without further optimization of the code for this new configuration.

Berkeley NOW The hardware configuration of the Berkeley NOW (Network Of Workstation) system comprise 105 Sun Ultra 170 workstations connected by Myricom networks. Each node includes 167MHz Ultra 1 microprocessor with 512 KB cache, 128 MB of RAM, two 2.3 GB disk space.

Cray T3E The Cray T3E-1200 is a scalable shared-memory multiprocessor based on the DEC Alpha microprocessor. It provides a shared physical address space of up to 2048 processors over a 3D torus interconnect. Each node of the system contains an Alpha processor each of which is capable of 1200 Mflops. The system logic runs at 75 MHz, and the processor runs at some multiple of this, such as 600 MHz for Cray T3E Torus links provide a raw bandwidth of 650 MBps in each direction to maintain system balance with the faster processors and memory.

NPB Graph Results

The Dwarves –Hardware Old PII at 300MHz processors –Will be removed soon. 8 PIII at 450MHz processors 4 PIII at 733MHz processors The new machines: – Dual AMD Athlon(tm) MP 1,666MHz. 1GB Memory.

In The Next 2 Weeks Install the NPB 2.2 on the Dwarves cluster Run the Benchmark tests on the Dwarves Cluster Run tests on several different configurations (different number of dwarves) Estimate Network Bandwidth and latency. Compare the Dwarves cluster performance to similar clusters in the world

Questions will not be answered !!! GOOD NIGHT