NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.
Ver 0.1 Page 1 SGI Proprietary Introducing the CRAY SV1 CRAY SV1-128 SuperCluster.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Beowulf Supercomputer System Lee, Jung won CS843.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
History of Distributed Systems Joseph Cordina
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Sun Fire™ E25K Server Keith Schoby Midwestern State University June 13, 2005.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
EEL5708/Bölöni Lec 1.1 August 21, 2006 Lotzi Bölöni Fall 2006 EEL 5708 High Performance Computer Architecture Lecture 1 Introduction.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
EKT303/4 Superscalar vs Super-pipelined.
EGRE 426 Computer Organization and Design Chapter 4.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Background Computer System Architectures Computer System Software.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Distributed Processors
Constructing a system with multiple computers or processors
CRESCO Project: Salvatore Raia
Department of Computer Science University of California, Santa Barbara
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Performance of computer systems
MPJ: A Java-based Parallel Computing System
Chapter 4 Multiprocessors
Multicore and GPU Programming
Department of Computer Science University of California, Santa Barbara
Computer Organization and Design Chapter 4
Presentation transcript:

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger Allan Snavely San Diego Supercomputer Center June 19, 1998

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Background CRAY vector computers have been the workhorses of scientific computing for over 2 decades. CRAY PVPs have been ‘effort/performance’ leaders due to vector processors, flat shared memory, and great tools. Vector machines are still very popular in terms of number of users and available scientific applications software. NPACI currently offers T916/14, J98/5, J916/16. There is lots of legacy vector code, much of which will never see an MPI_Send call. T90s are the last in the line of CRAY PVP computers.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim More Background Tera has developed revolutionary new architecture, the MTA, for parallel computing with a programming model as simple as the PVP model. MTA can exploit more levels of parallelism than T90. First Tera machine (MTA, for MultiThreaded Architecture) was delivered to SDSC in November 1997 with a single 145 MHz processor (< 1/2 final speed). Tera delivered a two processor system to SDSC in early 1998 with two 255 MHz (still not final) processors and a network board (not final, either), but no UNIX.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Caveats, Disclaimers, and Excuses MTA software is still being debugged. Processors are not running at full speed: –theoretical peak is 765 Mflops/CPU (255MHz), but will rise to Gflops Interconnect is not up to specification: –memory-intensive codes cannot speed up by more than 1.75 until new network boards are installed All of the above are improving daily and are production issues, not research issues. We have had 2 processors running and a stable OS (but not UNIX yet) for only a few weeks. Time is shared w/Tera.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim T90/MTA Hardware Comparison CRAY T MHz frequency element vector registers/CPU Dual vector pipes into FUs Pipelines ADD and MULT units Can execute 4 flops/cycle (commonly 2) Flat shared memory DRAM, high bandwidth, low latency Can issue 2 loads + 1 store / cycle Peak 1.76 Gflops / CPU Practical peak of 1 Gflops Currently observe Mflops in 'good' user codes Tera MTA MHz clock (255MHz now) 128 Streams (HW for threads)/CPU Effective depth of pipeline is 21 Additional FMA unit Can execute 3 flops/cycle (commonly 2) Flat shared memory SRAM, moderate latency, moderate bandwidth Can issue 1 memory ref / cycle Peak 0.9+ Gflops / CPU Practical peak of 600 Mflops Tera expects sustained 30-60% of peak in 'good' user codes

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim NAS 2.3-Serial Benchmarks NAS Parallel Benchmarks version 2.3 –Level 2 are not pencil-and-paper; must be executed as is or with minimal tuning –Written using MPI for distributed memory, RISC-based machines NAS 2.3-Serial –‘Reverse-engineered’ from NPB 2.3; MPI versions were ‘serialized’ –Not necessarily optimal for vector or multithreaded platforms ‘as is’

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim NAS 2.3-Serial Benchmarks Results

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Applications Performance: Disclaimer MTA wasn’t available long enough to port, tune many applications 2 processors weren’t available long enough to obtain many multiprocessor results Most tuning effort performed by Tera staff Applications selected were not chosen for superior T90 performance: –LCPFCT performs very well on T90 –AMBER performs fairly well on T90 –LS-DYNA3D performs less well on T90 for many interesting cases

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim LCPFCT Performance Comparison

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim AMBER Performance Comparison

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim LS-DYNA3D Comparison

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Conclusions T90 multitasking doesn't allow the user fine control over load balancing. Porting T90 codes to the MTA is easy. Tuning on both platforms is facilitated by excellent compilers and simple programming models. MTA can exploit the same parallelism in a problem which the T90 can. Can also exploit levels which the T90 doesn’t. MTA is likely to give good performance & scalability on most T90 codes. The T90 is still the world's fastest vector machine, but the MTA may outperform it across a wider spectrum of problems using vectors but also having more potential outer-loop, and higher level, parallelism.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Future MTA Hardware Plans 4-processor network to be delivered soon (July?) 2 more processors delivered shortly thereafter (August?) [With each processor comes one or two 1GB memory modules (not associated directly with processor, just how network is built)] UNIX will be completed by end of summer (Aug-Sept?) Pending results of evaluations, increase size to 8 (end of year?), then 16 (next year) Fortran 90, OpenMP, other tools on the way...

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim Future Work SC98: –updated NAS benchmarks (‘final’ processors, network) –multiprocessor benchmarks –applications as well as kernels Applications Porting and Tuning: –More work on AMBER, LS-DYNA3D –Port GAMESS, MPIRE, OVERFLOW –Port other vendor and research codes –Suggestions?