FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

2. Getting Started Heejin Park College of Information and Communications Hanyang University.
Using Matrices in Real Life
AP STUDY SESSION 2.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Processes and Operating Systems
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Break Time Remaining 10:00.
SE-292 High Performance Computing
Announcements Homework 6 is due on Thursday (Oct 18)
Figure 12–1 Basic computer block diagram.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Factoring Quadratics — ax² + bx + c Topic
Chapter 5 : Memory Management
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Database Performance Tuning and Query Optimization
Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
PP Test Review Sections 6-1 to 6-6
Briana B. Morrison Adapted from William Collins
Chapter 10: Virtual Memory
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW.
Presentation transcript:

FHTE 4/26/11 1

FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale Locality required for efficiency Power 1-2nJ/operation today 20pJ required for ExaScale Dominated by data movement and overhead Other issues – reliability, memory bandwidth, etc… are subsumed by these two or less severe

FHTE 4/26/11 3 ExaScale Programming

FHTE 4/26/11 4 Fundamental and Incidental Obstacles to Programmability Fundamental Expressing 10 9 way parallelism Expressing locality to deal with >100:1 global:local energy Balancing load across 10 9 cores Incidental Dealing with multiple address spaces Partitioning data across nodes Aggregating data to amortize message overhead

FHTE 4/26/11 5 The fundamental problems are hard enough. We must eliminate the incidental ones.

FHTE 4/26/11 6 Very simple hardware can provide Shared global address space (PGAS) No need to manage multiple copies with different names Fast and efficient small (4-word) messages No need to aggregate data to make Kbyte messages Efficient global block transfers (with gather/scatter) No need to partition data by node Vertical locality is still important

FHTE 4/26/11 7 A Layered approach to Fundamental Programming Issues Hardware mechanisms for efficient communication, synchronization, and thread management Programmer limited only by fundamental machine capabilities A programming model that expresses all available parallelism and locality hierarchical thread arrays and hierarchical storage Compilers and run-time auto-tuners that selectively exploit parallelism and locality

FHTE 4/26/11 8 Execution Model A A B B Active Message Abstract Memory Hierarchy Global Address Space ThreadObject B B Load/Store A A B B Bulk Xfer

FHTE 4/26/11 9 Thread array creation, messages, block transfers, collective operations – at the speed of light

FHTE 4/26/11 10 Language Describes all Parallelism and Locality – not mapping forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { molecule.force = reduce_sum(force(molecule, neighbor)) }

FHTE 4/26/11 11 Language Describes all Parallelism and Locality – not mapping compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ; forall(i in 0:N-1) { compute_forces(part_molecules) ; }

FHTE 4/26/11 12 Autotuning Search Spaces T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation. In IEEE PACT, pages , Exe Execution Time of Matrix Multiplication for Unrolling and Tiling Architecture enables simple and effective autotuning

FHTE 4/26/11 13 Performance of Auto-tuner Conv2DSGEMMFFT3DSUmb CellAuto Hand ClusterAuto Hand Cluster of PS3s Auto Hand Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS. For FFT3D, performances is with fusion of leaf tasks. SUmb is too complicated to be hand-tuned.

FHTE 4/26/11 14 What about legacy codes? They will continue to run – faster than they do now But… They dont have enough parallelism to begin to fill the machine Their lack of locality will cause them to bottleneck on global bandwidth As they are ported to the new model The constituent equations will remain largely unchanged The solution methods will evolve to the new cost model

FHTE 4/26/11 15 The Power Challenge

FHTE 4/26/11 16 Addressing The Power Challenge (LOO) Locality Bulk of data must be accessed from nearby memories (2pJ) not across the chip (150pJ) off chip (300pJ) or across the system (1nJ) Application, programming system, and architecture must work together to exploit locality Overhead Bulk of execution energy must go to carrying out the operation not scheduling instructions (100x today) Optimization At all levels to operate efficiently

FHTE 4/26/11 17 Locality

FHTE 4/26/11 18 The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit DP 20pJ 26 pJ256 pJ 1 nJ 500 pJ Efficient off-chip link 28nm 256-bit buses 16 nJ DRAM Rd/Wr 256-bit access 8 kB SRAM 50 pJ

FHTE 4/26/11 19 Scaling makes locality even more important

FHTE 4/26/11 20 Its not about the FLOPS Its about data movement Algorithms should be designed to perform more work per unit data movement. Programming systems should further optimize this data movement. Architectures should facilitate this by providing an exposed hierarchy and efficient communication.

FHTE 4/26/11 21 Locality at all Levels Application Do more operations if it saves data movement E.g., recompute values rather than fetching them Programming system Optimize subdivision Choose when to exploit spatial locality with active messages Choose when to compute vs. fetch Architecture Exposed storage hierarchy Efficient communication and bulk transfer

FHTE 4/26/11 22 System Sketch

FHTE 4/26/11 23 Echelon Chip Floorplan 17mm 10nm process 290mm 2

FHTE 4/26/11 24 Overhead

FHTE 4/26/ /11/11Milad Mohammadi25 An Out-of-Order Core Spends 2nJ to schedule a 50pJ FMA (or an 0.5pJ integer add)

FHTE 4/26/11 26 SM Lane Architecture

FHTE 4/26/11 27 Optimization

FHTE 4/26/11 28 Optimization needed at all levels Guided by where most of the power goes Circuits Optimize V DD, V T Communication circuits – on-chip and off Architecture Grocery list approach – know what each operation costs Example – temporal SIMT An evolution of the classic vector architecture Programming Systems Tuning for particular architectures Macro-optimization Applications New methods driven by the new cost equation

FHTE 4/26/11 29 On-Chip Communication Circuits

FHTE 4/26/11 30 Temporal SIMT Existing Single Instruction Multiple Thread (SIMT) architectures amortize instruction fetch across multiple threads, but: Perform poorly (and energy inefficiently) when threads diverge Execute redundant instructions that are common across threads Solution: Temporal SIMT Execute threads in thread group in sequence on a single lane Amortize fetch Shared registers for common values Scalarization – amortize execution

FHTE 4/26/11 31 Solving the Power Challenge – 1, 2, 3

FHTE 4/26/11 32 Solving the ExaScale Power Problem

FHTE 4/26/11 33 Log Scale Bars on top are larger than they appear

FHTE 4/26/11 34 The Numbers (pJ)

FHTE 4/26/11 35 CUDA GPU Roadmap DP GFLOPS per Watt Tesla Fermi Kepler Maxwell Jensen Huangs Keynote at GTC 2010

FHTE 4/26/11 36 Investment Strategy

FHTE 4/26/11 37 Do we need exotic technology? Semiconductor, optics, memory, etc…

FHTE 4/26/11 38 Do we need exotic technology? Semiconductor, optics, memory, etc… No, but well take what we can get … and thats the wrong question

FHTE 4/26/11 39 The right questions are: Can we make a difference in core technologies like semiconductor fab, optics, and memory? What investments will make the biggest difference (risk reduction) for ExaScale?

FHTE 4/26/11 40 Can we make a difference in core technologies like semiconductor fab, optics, and memory? No, there is a $100B+ industry already driving these technologies in the right direction. The little we can afford to invest (<$1B) wont move the needle (in speed or direction)

FHTE 4/26/11 41 What investments will make the biggest difference (risk reduction) for ExaScale? Look for long poles that arent being addressed by the data center or mobile industries.

FHTE 4/26/11 42 What investments will make the biggest difference (risk reduction) for ExaScale? Programming systems – they are the long pole of the tent and modest investments will make a huge difference. Scalable, fine-grain, architecture – communication, synchronization, and thread management mechanisms needed to achieve strong scaling – conventional machines will stick with weak scaling for now.

FHTE 4/26/11 43 Summary

FHTE 4/26/11 44 ExaScale Requires Change Programming Systems Eliminate incidental obstacles to parallelism Provide global address space, fast, short messages, etc… Express all of the parallelism and locality - abstractly Not the way current codes are written Use tools to map these applications to different machines Performance portability Power Locality: In the application, mapped by the programming system, supported by the architecture Overhead From 100x to 2x by building throughput cores Optimization At all levels The largest challenge is admitting we need to make big changes. This requires investment in research, not just procurements