Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = 10 12 floating point.

Slides:



Advertisements
Similar presentations
Chapter 7 Multicores, Multiprocessors, and Clusters.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Lecture 6: Multicore Systems
1 Computational models of the physical world Cortical bone Trabecular bone.
Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallel computer architecture classification
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Computing Resources Joachim Wagner Overview CNGL Cluster MT Group Cluster School Cluster Desktop PCs.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Russ Miller Center for Computational Research Computer Science & Engineering SUNY-Buffalo Hauptman-Woodward Medical Inst IDF: Multi-Core Processing for.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
1 SciDAC 2009 Tutorial © Koniges, Rabenseifner, Jost, Hager & others Programming Models and Languages for Clusters of Multi-core Nodes Part 1: Introduction.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee.
Understanding Parallel Computers Parallel Processing EE 613.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
High performance computing architecture examples Unit 2.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
William Stallings Computer Organization and Architecture 8th Edition
Brief introduction about “Grid at LNS”
NIIF HPC services for research and education
Graphics Processor Graphics Processing Unit
TI Information – Selective Disclosure
Lynn Choi School of Electrical Engineering
NVIDIA’s Extreme-Scale Computing Project
Chapter 1: Perspectives
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
Parallel Computers Today
Samuel Williams1,2, David Patterson1,
Nicole Ondrus Top 500 Parallel System Presentation
Chapter 1: Perspectives
Multicore / Multiprocessor Architectures
Course Description: Parallel Computer Architecture
Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers
Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
William Stallings Computer Organization and Architecture 8th Edition
An Overview of MIMD Architectures
William Stallings Computer Organization and Architecture 8th Edition
Multicore and GPU Programming
Multicore and GPU Programming
Kaushik Datta1,2, Mark Murphy2,
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec (10 15 )

Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)

Beowulf (18-processor cluster, lab machine)

AMD Opteron quad-core die

The nVidia G80 GPU 128 streaming floating point 1.5 Gb Shared RAM with 86Gb/s bandwidth 500 Gflop on one chip (single precision)

The Computer Architecture Challenge  Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.  Originally, because linear algebra is the middleware of scientific computing.  Nowadays, mostly for bragging rights. = x P A L U

Top 500 List

Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

Multicore SMP Systems 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction)

More Detail on GPU Architecture

Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts 1-arch/feeding_the_beast_perrone.pdf

Cray XMT (highly multithreaded shared memory)