Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL.

Similar presentations


Presentation on theme: "1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL."— Presentation transcript:

1 1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL

2 2 NERSC- National Energy Research Scientific Computing Center Mission: Accelerate the pace of scientific discovery by providing high performance computing, information, data, and communications services for all DOE Office of Science (SC) research. The production computing facility for DOE SC. Berkeley Lab Computing Sciences Directorate –Computational Research Division (CRD), ESnet –NERSC 2

3 3 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population Over 3000 users, 400 projects, 500 codes NERSC Serves DOE SC Mission –Allocated by DOE program managers –Not limited to largest scale jobs –Not open to non-DOE applications Strategy: Science First –Requirements workshops by office –Procurements based on science codes –Partnerships with vendors to meet science requirements

4 4 NERSC Systems for Science Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 9,532 compute nodes; 38,128 cores ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 cores Phase 2: 1.25 Pflop/s peak (late 2010 delivery) HPSS Archival Storage 40 PB capacity 4 Tape libraries 150 TB disk cache NERSC Global Filesystem (NGF) Uses IBM’s GPFS 1.5 PB capacity 5.5 GB/s of bandwidth Clusters 140 Tflops total Carver IBM iDataplex cluster PDSF (HEP/NP) ~1K core throughput cluster Magellan Cloud testbed IBM iDataplex cluster GenePool (JGI) ~5K core throughput cluster Analytics Euclid (512 GB shared memory) Dirac GPU testbed (48 nodes)

5 5 NERSC Roadmap 5 10 7 10 6 10 5 10 4 10 3 10 2 10 200620072008200920102011201220132014201520162017201820192020 Top500 Franklin (N5) 19 TF Sustained 101 TF Peak Franklin (N5) +QC 36 TF Sustained 352 TF Peak Hopper (N6) >1 PF Peak NERSC-7 10 PF Peak NERSC-8 100 PF Peak NERSC-9 1 EF Peak Peak Teraflop/s Users expect 10x improvement in capability every 3-4 years How do we ensure that Users Performance follows this trend and their Productivity is unaffected ?

6 6 A Plan of Attack 1.Understand the technology trends 2.Understand the science needs 3.Influence the technology & the applications simultaneously? 1.Co-Design !

7 7 Hardware Trends: The Multicore era Moore’s Law continues unabated Power constraints means cores will double every 18 months not clock speed Memory capacity is not doubling at the same rate – GB/core will decrease Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith Power is the Leading Design Constraint

8 8 From Peter Kogge, DARPA Exascale Study Current Technology Roadmaps will Depart from Historical Gains Power is the Leading Design Constraint

9 9 … and the power costs will still be staggering From Peter Kogge, DARPA Exascale Study $1M per megawatt per year! (with CHEAP power)

10 10 Changing Notion of “System Balance” If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance) If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better) Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system We have a fixed budget –Sustained to peak FLOP rate is wrong metric if FLOPs are cheap –Balance involves balancing your checkbook & balancing your power budget –Requires a application co-design make the right trade-offs

11 11 Summary: Technology Trends: Number Cores  –Flops will be “free” Memory Capacity per core  Memory Bandwidth per core  Network Bandwidth per core  I/O Bandwidth 

12 12 Navigating Technology Phase Transitions 10 7 10 6 10 5 10 4 10 3 10 2 10 200620072008200920102011201220132014201520162017201820192020 Top500 COTS/MPP + MPI COTS/MPP + MPI (+ OpenMP) GPU CUDA/OpenCL Or Manycore BG/Q, R Exascale + ??? Franklin (N5) 19 TF Sustained 101 TF Peak Franklin (N5) +QC 36 TF Sustained 352 TF Peak Hopper (N6) >1 PF Peak NERSC-7 10 PF Peak NERSC-8 100 PF Peak NERSC-9 1 EF Peak Peak Teraflop/s 12

13 13 Application Scalability How can a user continue to be productive in the face of these disruptive technology trends?

14 14 Source of Workload Information 14 Documents –2005 DOE Greenbook –2006-2010 NERSC Plan –LCF Studies and Reports –Workshop Reports –2008 NERSC assessment Allocations analysis User discussion

15 15 New Model for Collecting Requirements Joint DOE Program Office / NERSC Workshops Modeled after ESnet method –Two workshops per year –Describe science-based needs over 3-5 years Case study narratives –First workshop is BER, May 7, 8 15

16 16 Numerical Methods at NERSC (Caveat: survey data from ERCAP requests)

17 17 Application Trends Weak Scaling –Time to solution is often a non-linear function of problem size Strong Scaling –Latency or Serial fraction will get you in the end. Add features to models – “New” Weak Scaling “Processors” Performance “Processors” Performance

18 18 Develop Best Practices in Multicore Programming NERSC/Cray Programming Models “Center of Excellence” combines: LBNL strength in languages, tuning, performance analysis Cray strength in languages, compilers, benchmarking Goals: Immediate goal is training material for Hopper users: hybrid OpenMP/MPI Long term input into exascale programming model = OpenMP thread parallelism 18

19 19 Develop Best Practices in Multicore Programming = OpenMP thread parallelism Conclusions so far: Mixed OpenMP/MPI saves significant memory Running time impact varies with application 1 MPI process per socket is often good Run on Hopper next: 12 vs 6 cores per socket Gemini vs. Seastar 19

20 20 Co-Design Eating our own dogfood 20

21 21 Inserting Scientific Apps into the Hardware Development Process Research Accelerator for Multi-Processors (RAMP) –Simulate hardware before it is built! –Break slow feedback loop for system designs –Enables tightly coupled hardware/software/science co-design (not possible using conventional approach)

22 22 Summary Disruptive technology changes are coming By exploring – new programming models (and revisiting old ones) –Hardware software co-design We hope to ensure that scientists productivity remains high !

23 23

24 24 Exascale Machine Wish List - Performance Lightweight Communication –Single-sided messaging Performance Feedback –Why is my code now slower than the last run? –Autotuning Fine grained control of data movement –Cache Bypass

25 25 Exascale Machine Wish List - Productivity Simplest possible execution model –Portable programming model –Hide inhomogeneity Debugging Support –Race conditions + Deadlocks Reliability –No desire to add error detection to application


Download ppt "1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL."

Similar presentations


Ads by Google