National Partnership for Advanced Computational Infrastructure Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
Today’s topics Single processors and the Memory Hierarchy
Beowulf Supercomputer System Lee, Jung won CS843.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Planned Machines: ASCI Purple, ALC and M&IC MCR Presented to SOS7 Mark Seager ICCD ADH for Advanced Technology Lawrence Livermore.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services.
Massive High-Performance Global File Systems for Grid Computing -By Phil Andrews, Patricia Kovatch, Christopher Jordan -Presented by Han S Kim.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
Lecture 1: Introduction to High Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
A.V. Bogdanov Private cloud vs personal supercomputer.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Information Technology at Purdue Presented by: Dr. Gerry McCartney Vice President and CIO, ITaP HPC User Forum September 8-10, 2008 Using SiCortex SC5832.
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
Example: Sorting on Distributed Computing Environment Apr 20,
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
HPSS for Archival Storage Tom Sherwin Storage Group Leader, SDSC
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
CS591x -Cluster Computing and Parallel Programming
Morgan Kaufmann Publishers
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Outline Why this subject? What is High Performance Computing?
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Tackling I/O Issues 1 David Race 16 March 2010.
COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University
NASA Langley Research Center’s Distributed Mass Storage System (DMSS) Juliet Z. Pao Guest Lecturing at ODU April 8, 1999.
Page : 1 SC2004 Pittsburgh, November 12, 2004 DEISA : integrating HPC infrastructures in Europe DEISA : integrating HPC infrastructures in Europe Victor.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
San Diego Supercomputer Center
TeraScale Supernova Initiative
Cluster Computers.
Presentation transcript:

National Partnership for Advanced Computational Infrastructure Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure Course Organization Professors / TA Sid Karin - Director, San Diego Supercomputer Center, Reagan Moore - Associate Director, SDSC Holly Dail - UCSD TA Seminars State of the art computer architectures Mid-term / SDSC tour Final exam

National Partnership for Advanced Computational Infrastructure Seminars 4/3 : Reagan Moore- Performance evaluation heuristics & modeling 4/10: Sid Karin - Historical perspective 4/17: Richard Kaufmann, Compaq - Teraflops systems 4/24: IBM or Sun 5/1 : Mark Seager, LLNL - ASCI 10 Tflops computer 5/8 : Midterm / SDSC Tour 5/15: John Feo, Tera - Multi-threaded architectures 5/22: Peter Beckman, LANL - Clusters 5/29: Holiday / no class 6/5 : Thomas Sterling, Caltech - Petaflops computers 6/12 : Final exam

National Partnership for Advanced Computational Infrastructure Distributed Archives Application Digital Library Data Mining Supercomputers for Simulation and Data Mining Information Discovery Collection Building

National Partnership for Advanced Computational Infrastructure Heuristics for Characterizing Supercomputers Generators of data - numerically intensive computing Usage models for the rate at which supercomputers move data between memory, disk, and archives Usage models for capacity of the data caches (memory size, local disk, and archival storage) Analyzers of data - data intensive computing Performance models for combining data analysis with data movement (between caches, disks, archives)

National Partnership for Advanced Computational Infrastructure Heuristics Experience based models of computer usage Dependent on computer architecture Presence of data caches, memory-mapped I/O Architectures used at SDSC CRAY vector computers X/MP, Y/MP, C-90, T-90 Parallel computers MPPs - Ipsc 860, Paragon, T3D, T3E Clusters - SP

National Partnership for Advanced Computational Infrastructure Supercomputer Data Flow Model CPU Memory Local Disk Archive Disk Archive tape

National Partnership for Advanced Computational Infrastructure Y-MP Heuristics Utilization measured on Cray Y-MP Real memory architecture - entire job context is in memory, no paging of data Exceptional memory bandwidth I/O rate from CPU to memory was 28 Bytes per cycle Maximum execution rate was 2 Flops per cycle Scaled memory on C-90 to test heuristics Noted that increasing memory from 1 GB to 2 GBs decreased idle time from 10% to 2 % Sustained execution rate was 1.8 GFlops

National Partnership for Advanced Computational Infrastructure Data Generation Metrics CPUMemory Local Disk Archive Disk Archive tape 7 Bytes/Flop 1 Byte/60 Flop 1 Byte of storage per Flops 1/7 of data persists for a day 1/7 of data sent to archive Hold data forever Hold data for 1 week Hold data for 1 day All data sent to tape

National Partnership for Advanced Computational Infrastructure Peak Teraflops System Compute Engine Local Disk Archive Disk Archive Tape TB memory Sustain ? GF ? GB/sec ? TB 1 day cache ? MB/sec 1 week cache ? MB/sec ? TB ? PB TeraFlops System

National Partnership for Advanced Computational Infrastructure Data Sizes on Disk How much scratch space is used by each job? Disk space is times the memory size. Data lasts for about one day Average execution time for long running jobs 30 minutes to 1 hour For jobs using all of memory Between 48 and 24 jobs per day Each job uses (Disk space) / (Number of jobs) Or 40/48 Memory = 80% of memory

National Partnership for Advanced Computational Infrastructure Peak Teraflops Data Flow Model Compute Engine Local Disk Archive Disk Archive Tape TB memory Sustain 150 GF 1 GB/sec 10 TB 1 day cache 40 MB/sec 1 week cache 40 MB/sec 5 TB PB TeraFlops System

National Partnership for Advanced Computational Infrastructure HPSS Archival Storage System 108 GB SSA RAID High Performance Gateway Node High Node Disk Mover HiPPI driver Wide Node Disk Mover HiPPI driver 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon Silver Node Tape / disk mover DCE / FTP /HIS Log Client 160 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 830 GB MaxStrat RAID 9490 Robot Four Drives 3490 Tape RS6000 Tape Mover PVR (9490) HiPPI Switch Trail- Blazer3 Switch Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 3494 Robot Eight Tape Drives Magstar 3590 Tape 3494 Robot Seven Tape Drives

National Partnership for Advanced Computational Infrastructure Equivalent of Ohm’s Law for Computer Science How does one relate application requirements to computation rates and I/O bandwidths? Use prototype data movement problem to derive physical parameters that characterize applications.

National Partnership for Advanced Computational Infrastructure Data Distribution Comparison Data Handling Platform Supercomputer Execution rate r<R Bandwidths linking systems areB & b Operations per bit for analysis is C Operations per bit for data transfer isc Reduce size of data from S bytes to s bytes and analyze Should the data reduction be done before transmission? Data Bb

National Partnership for Advanced Computational Infrastructure Distributing Services Compare times for analyzing data with size reduction from S to s Read Data Reduce Data Transmit Data NetworkReceive Data Read Data Reduce Data Transmit Data Network Receive Data S / BC S / rc s / rs / bc s / R c S / Rc S / rS / bC S / RS / B Data Handling Platform Supercomputer Data Handling Platform Supercomputer

National Partnership for Advanced Computational Infrastructure Comparison of Time T(Super) = S/B + CS/r + cs/r + s/b + cs/R Processing at supercomputer Processing at archive T(Archive) = S/B + cS/r + S/b + cS/R + CS/R

National Partnership for Advanced Computational Infrastructure Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + CS/r + cs/r + s/b + cs/R < S/B + cS/r + S/b + cS/R + CS/R Which variable provides the simplest optimization criterion?

National Partnership for Advanced Computational Infrastructure Scaling Parameters Data size reduction ratio s/S Execution slow down ratior/R Problem complexityc/C Communication/Execution balancer/(cb) When r/(cb) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(cb) = 1 Note (r/c) is the number of bits/sec that can be processed.

National Partnership for Advanced Computational Infrastructure Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network b > (r /C) (1 - s/S) / [1 - r/R - (c/C) (1 + r/R) (1 - s/S)] Note the denominator changes sign when C < c (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.

National Partnership for Advanced Computational Infrastructure Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (c/C) (1 - s/S)] / [1 - (c/C) (1 - s/S) (1 + r/(cb)] Note the denominator changes sign when C < c (1 - s/S) [1 + r/(cb)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.

National Partnership for Advanced Computational Infrastructure Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (C/c)(1 - r/R) / [1 + r/R + r/(cb)]} Note criteria changes sign when C > c [1 + r/R + r/(cb)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.

National Partnership for Advanced Computational Infrastructure Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis C > c (1-s/S) [1 + r/R + r/(cb)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero.

National Partnership for Advanced Computational Infrastructure Characterization of Supercomputer Systems Sufficiently high complexity Move data to processing engine Digital Library execution of remote services Traditional supercomputer processing of applications Sufficiently low complexity Move process to the data source Metacomputing execution of remote applications Traditional digital library service

National Partnership for Advanced Computational Infrastructure Computer Architectures Processor in memory Do computations within memory Complexity of supported operations Commodity processors L2 caches L3 caches Parallel computers Memory bandwidth between nodes MPP - shared memory Cluster - distributed memory

National Partnership for Advanced Computational Infrastructure Characterization Metric Describe systems in terms of their balance Optimal designs have r/(cb) = 1 Equivalent of Ohm’s law R = C B Characterize applications in terms of their complexity Operations per byte of data C = R / B

National Partnership for Advanced Computational Infrastructure Second Example Inclusion of latency (time for process to start) and overhead (time to execute communication protocol) Illustrate with combined optimization of use of network and CPU

National Partnership for Advanced Computational Infrastructure Optimizing Use of Resources Compare time needed to do calculations with time needed to access data over a network Time spent using a CPU = Execution time + protocol processing time = Cc * Sc / Rc + Cp * St / Rp Where St = size of transmitted data (bytes) Sc = size of application data (bytes) Cc = number of operations per byte of transmitted data for the application Cp = number of operations per byte to process protocol Rc = execution rate of application Rp = execution rate of protocol

National Partnership for Advanced Computational Infrastructure Characterizing Latency Time during which a network transmits data = Latency for initiating transfer + transmission time = L + St / B Where L is the round trip latency at the speed of light (sec) B is the bandwidth (bytes/sec)

National Partnership for Advanced Computational Infrastructure Solve for Balanced System CPU utilization time = Network utilization time Solve for transmission size as a function of Sc/St St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1] Solution exists when Sc/St > [Rc / (B*Cc)] [1 - B*Cp / Rp] and B * Cp / Rp < 1

National Partnership for Advanced Computational Infrastructure Comparing Utilization of Resources Network utilization Un = Transmission time / (Transmission + latency) = 1 / [1 + (L * B / St)] CPU utilization Uc = Execution time / (Execution + Protocol processing) = 1 / [1 + (Cp * Rc) / (Cc * Rp) * (St / Sc)] Define h = Sc / St

National Partnership for Advanced Computational Infrastructure Comparing Efficiencies h = S-compute / S-transmit Utilization U-cpu U-network

National Partnership for Advanced Computational Infrastructure Crossover Point When utilization of bandwidth and execution resources is balanced: 1 / [1 + (L * B / St)] = 1 / [1 + (Cp * Rc) / (Cc * Rp) / h] For optimal St, solve for h = Sc/St, and find h = (Rc Cp / 2 Rp Cc) [ sqrt(1 + 4 Rp / Cp B) -1] For small B * Cp / Rp h ~ Rc / Cc B or St / B ~ Sc Cc / Rc And transmission time ~ execution time

National Partnership for Advanced Computational Infrastructure Application Summary Optimal application for a given architecture B * Cc / Rc ~ 1 (Bytes/sec) (Operations/byte) / (Operations/sec) Cc ~ Rc / B Also need cost of network utilization to be small B * Cp / Rp < 1 And amount of data transmitted proportional to latency St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1]

National Partnership for Advanced Computational Infrastructure Further Information