Lecture 1: Introduction to High Performance Computing.
Published byModified over 4 years ago
Presentation on theme: "Lecture 1: Introduction to High Performance Computing."— Presentation transcript:
Lecture 1: Introduction to High Performance Computing
Grand challenge problem A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers.
Weather Forecasting Cells of size 1 mile x 1 mile x 1 mile => Whole global atmosphere about 5 x 10 8 cells If each calculation requires 200 Flops => 10 11 Flops, in one time step To forecast the weather over 10 days using 10-minute intervals, with a computer operating at 100 Mflops (10 8 Flops/s) => would take 10 7 seconds or over 100 days. To perform the calculation in 10 minutes would require a computer operating at 1.7 Tflops (1.7 x 10 12 Flops/s).
Some Grand Challenge Applications Science Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Business Financial and economic modeling Transaction processing, web services and search engines Defense Nuclear weapons -- test by simulations Cryptography
Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Moore’s Law holds also for performance and capacity 19452002 ComputerENIACLaptop Number of vacuum tubes / transistors 18 0006 000 000 000 Weight (kg)27 2000.9 Size (m 3 )680.0028 Power (watts)20 00060 Cost ($)4 630 0001 000 Memory (bytes)2001 073 741 824 Performance (Flops/s)8005 000 000 000
Peak Performance A contemporary RISC processor delivers 10% of its peak performance Two primary reasons behind this low efficiency: IPC inefficiency Memory inefficiency
Instructions per cycle (IPC) inefficiency Today the theoretical IPC is 4-6 Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4 ~75% of the performance is not used
Reasons for IPC inefficiency Latency Waiting for access to memory or other parts of the system Overhead Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform Starvation Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint
Processor-Memory Problem Processors issue instructions roughly every nanosecond DRAM can be accessed roughly every 100 nanoseconds The gap is growing: processors getting faster by 60% per year DRAM getting faster by 7% per year
How fast can a serial computer be? Consider the 1 Tflop sequential machine data must travel distance, r, to get from memory to CPU to get 1 data element per cycle, this means 10 12 times per second at the speed of light, c = 3x10 8 m/s so r < c / 10 12 = 0.3 mm For 1 TB of storage in a 0.3 mm 2 area each word occupies about 3 Angstroms 2, the size of a small atom
High Performance Computers In 1980s 1x10 6 Floating Point Ops/sec (Mflop/s) Scalar based In 1990s 1x10 9 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing Today 1x10 12 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing
What is a Supercomputer? A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved
Top500 Computers Over the last 10 years the range for the Top500 has increased greater than Moore’s law: 1993 #1 = 59.7 GFlop/s #500 = 422 MFlop/s 2004 #1 = 70 TFlop/s #500 = 850 GFlop/s
Top500 List at June 2005 Manuf.ComputerInstal. SiteCntryYearRmax (Tflop/s) #proc 1 IBMBlueGene/LLLNLUSA2005136.865536 2 IBMBlueGene/LIBM Watson Res. Center USA200591.340960 3 SGIAltixNASAUSA200451.910160 4 NECVectorEarth Simulator Center Japan200235.95120 5 IBMClusterBarcelona Supercomp. C. Spain200527.94800
Increasing CPU Performance Manycore Chip Composed of hybrid cores Some general purpose Some graphics Some floating point
What is Next? Board composed of multiple manycore chips sharing memory Rack composed of multiple boards A room full of these racks Millions of cores Exascale systems (10 18 Flop/s)
Moore’s Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Number of threads of execution doubles every 2 year
Directions Move toward shared memory SMPs and Distributed Shared Memory Shared address space with deep memory hierarchy Clustering of shared memory machines for scalability Efficiency of message passing and data parallel programming MPI and HPF
Future of HPC Yesterday's HPC is today's mainframe is tomorrow's workstation