Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = 10 12 floating.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
1 Computational models of the physical world Cortical bone Trabecular bone.
Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
CS 140: Models of parallel programming: Distributed memory and MPI.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.
SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
A many-core GPU architecture.. Price, performance, and evolution.
GPU Computing with CUDA as a focus Christie Donovan.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Lecture 1: Introduction to High Performance Computing.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Processors Menu  INTEL Core™ i Processor INTEL Core™ i Processor  INTEL Core i Processor INTEL Core i Processor  AMD A K.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Scientific Computing at Jefferson Lab Petabytes, Petaflops and GPUs Chip Watson Scientific Computing Group Jefferson Lab Presented at CLAS12 Workshop,
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Future of High Performance Computing Thom Dunning National Center.
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Lecture 2 : Introduction to Multicore Computing
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
Personal Chris Ward CS147 Fall  Recent offerings from NVIDA show that small companies or even individuals can now afford and own Super Computers.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
High Performance Computing
1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee.
Understanding Parallel Computers Parallel Processing EE 613.
08/28/2012CS4230 CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28,
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
NIIF HPC services for research and education
NVIDIA’s Extreme-Scale Computing Project
Super Computing By RIsaj t r S3 ece, roll 50.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Parallel Computers Today
Multicore and GPU Programming
Option Pricing Black-Scholes Equation
CSE 102 Introduction to Computer Engineering
Presentation transcript:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec (10 15 )

Supercomputers 1976:Cray-1, 133 MFLOPS (10 6 ) Supercomputers 1976: Cray-1, 133 MFLOPS (10 6 )

Trends in processor clock speed

AMD Opteron 12-core chip

AMD Opteron 6-core layout detail

The nVidia G80 GPU 128 streaming floating point 1.5 Gb Shared RAM with 86Gb/s bandwidth 500 Gflop on one chip (single precision)

More Detail on GPU Architecture

Cray XMT (highly multithreaded shared memory)

Top 500 List Graph 500 List

Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

4-core Intel Nehalem chip (2 per Triton node):

Triton memory hierarchy Node Memory Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Chip Node