Empowering efficient HPC with Dell Martin Hilgeman HPC Consultant EMEA.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
LEIT (ICT7 + ICT8): Cloud strategy - Cloud R&I: Heterogeneous cloud infrastructures, federated cloud networking; cloud innovation platforms; - PCP for.
Distributed Systems CS
CHEP 2012 Computing in High Energy and Nuclear Physics Forrest Norrod Vice President and General Manager, Servers.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Organization and Architecture 18 th March, 2008.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Introduction to Systems Architecture Kieran Mathieson.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Recap.
Computer Architecture Lecture 2 Instruction Set Principles.
Energy Model for Multiprocess Applications Texas Tech University.
Implementing Efficient RSS Capable Hardware and Drivers for Windows 7
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Introduction to Parallel Computing. Serial Computing.
Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
ITRS Factory Integration Difficult Challenges Last Updated: 30 May 2003.
Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Amdahl's Law Validity of the single processor approach to achieving large scale computing capabilities Presented By: Mohinderpartap Salooja.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Some Notes on Performance Noah Mendelsohn Tufts University Web: COMP 40: Machine.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Dell Banking & Securities HPC/ GRID Solutions Blake Gonzales HPC Computer Scientist.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
1 Thermal Management of Datacenter Qinghui Tang. 2 Preliminaries What is data center What is thermal management Why does Intel Care Why Computer Science.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Professor Arthur Trew Director, EPCC EPCC: KT in novel computing.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Abstract Increases in CPU and memory will be wasted if not matched by similar performance in I/O SLED vs. RAID 5 levels of RAID and respective cost/performance.
Parallel IO for Cluster Computing Tran, Van Hoai.
Tackling I/O Issues 1 David Race 16 March 2010.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
Spiros Papadimitriou Google Research Project re:Cycle Recycling CPU Cycles Stavros Harizopoulos HP Labs.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
CS203 – Advanced Computer Architecture
CS203 – Advanced Computer Architecture Performance Evaluation.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Performance. Moore's Law Moore's Law Related Curves.
Extreme Scale Infrastructure
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Introduction Super-computing Tuesday
Extreme Big Data Examples
for the Offline and Computing groups
What Exactly is Parallel Processing?
Red Hat User Group June 2014 Marco Berube, Cloud Solutions Architect
Literature Review Dongdong Chen Peng Huang
Introduction.
Introduction.
Parallel Processing Sharing the load.
Course Description: Parallel Computer Architecture
The University of Adelaide, School of Computer Science
Quiz Questions Parallel Programming Parallel Computing Potential
Presentation transcript:

Empowering efficient HPC with Dell Martin Hilgeman HPC Consultant EMEA

Global HPC Group Amdahl’s Law Gene Amdahl (1967): "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): 483–485. CHPC conference 2013 “The effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude” a: speedup n: number of processors p: parallel fraction 2

Global HPC Group Amdahl’s Law limits maximal speedup CHPC conference 2013 a: speedup n: number of processors p: parallel fraction 3

Global HPC Group Amdahl’s Law and Efficiency Diminishing returns: Tension between the desire to use more processors and the associated “cost” 4 CHPC conference 2013

Global HPC Group The Real Moore’s Law 5 The clock speed plateau The power ceiling IPC limit CHPC conference 2013

Global HPC Group Meanwhile Amdahl’s Law says that you cannot use them all efficiently Industry is applying Moore’s Law by adding more cores 6 Moore’s Law vs Amdahl's Law - “too Many Cooks in the Kitchen” CHPC conference 2013

Global HPC Group What levels do we have? Challenge: Sustain performance trajectory without massive increases in cost, power, real estate, and unreliability Solutions: No single answer, must intelligently turn “Architectural Knobs” 7 Hardware performance What you really get CHPC conference 2013

Global HPC Group Turning the knobs Frequency is unlikely to change much Thermal/Power/Leakage challenges 2 Moore’s Law still holds: 130 -> 22 nm. LOTS of transistors 3 Number of sockets per system is the easiest knob. Challenging for power/density/cooling/networking 4 IPC still grows FMA3/4, AVX, FPGA implementations for algorithms Challenging for the user/developer CHPC conference 2013

Global HPC Group Traditional IT server utilization rates remain low New µServers are emerging, x86 and ARM Further movement from 4->2->1 socket systems as their capabilities expand What to do with all the capacity? Software defined everything….. 9 Meanwhile… traditional IT is swimming in performance CHPC conference 2013

Global HPC Group Scaling sockets, power and density 10 ARM/ATOM: potential to disrupt perf/$$, perf/Watt model Shared Infrastructure evolving Highest efficiency for power and cooling Extending design to facility Modularized compute/ storage optimization 2000 nodes, 30 PB storage, 600 kW in 22 m2 CHPC conference 2013

Global HPC Group Which leaves knob 5: make your hands dirty! DO it=1,noprec DO itSub=1,subNoprec ix = ir(1,it,itSub) iy = ir(2,it,itSub) iz = ir(3,it,itSub) idx = idr(1,it,itSub) idy = idr(2,it,itSub) idz = idr(3,it,itSub) sum = 0.0 testx = 0.0 testy = 0.0 testz = 0.0 DO ilz=-lsz,lsz irez = iz + ilz IF (irez.ge.k0z.and.irez.le.klz) THEN DO ily=-lsy,lsy irey = iy + ily IF (irey.ge.k0y.and.irey.le.kly) THEN DO ilx=-lsx,lsx irex = ix + ilx IF (irex.ge.k0x.and.irex.le.klx) THEN sum = sum + field(irex,irey,irez)& * diracsx(ilx,idx) & * diracsy(ily,idy) & * diracsz(ilz,idz) * (dx*dy*dz) testx = testx + diracsx(ilx,idx) testy = testy + diracsy(ily,idy) testz = testz + diracsz(ilz,idz) END IF END DO END IF END DO END IF END DO rec(it,itSub) = sum END DO 11 DO itSub=1,subNoprec DO it=1, noprec ix = ir(1,it,itSub) iy = ir(2,it,itSub) iz = ir(3,it,itSub) idx = idr(1,it,itSub) idy = idr(2,it,itSub) idz = idr(3,it,itSub) sum = 0.0 startz = MAX(iz-lsz,k0z) starty = MAX(iy-lsy,k0y) startx = MAX(ix-lsx,k0x) stopz = MIN(iz+lsz,klz) stopy = MIN(iy+lsy,kly) stopx = MIN(ix+lsx,klx) DO irez = startz, stopz ilz = irez - iz IF (diracsz(ilz,idz).EQ. 0.d0 ) THEN CYCLE END IF dirac_tmp1 = diracsz(ilz,idz)*(dx*dy*dz) DO irey = starty, stopy ily = irey - iy dirac_tmp2 = diracsy(ily,idy) * dirac_tmp1 DO irex = startx, stopx ilx = irex - ix sum = sum + field(irex,irey,irez) & * diracsx(ilx,idx) & * dirac_tmp2 END DO rec(it,itSub)=sum END DO 92 seconds 17 seconds CHPC conference 2013

Global HPC Group 12 Efficiency optimization also applies across nodes CHPC conference 2013

Global HPC Group 13 CHPC conference 2012