Part I.Parallel Computing

Part I.Parallel Computing
Based on T Rauber Parallel Programming, Springer 2010 ( main textbook for Part 1.) C. Ivan,

Outline Motivation What is Parallel Computing Why Parallel Computing
How to Increase Performance Computational Model Attributes Basic of Parallel Computing Performance and Challenges Parallel Computing Landscape

Parallel Computing In the first issue of the CACM 1(1), 1958, Saul Gorn notes that: “We know that the so-called parallel computers are somewhat faster than the serial ones, and that no matter how fast we make access and arithmetic serially, we can do correspondingly better in parallel. However access and arithmetic speeds seem to be approaching a definite limit ... Does this mean that digital speeds are reaching a limit, that digital computation of multi-variate partial differential systems must accept it, and that predicting tomorrow’s weather must require more than one day? Not if we have truly- parallel machines, with say, a thousand instruction registers.” Gorn also recognized the challenges that would be posed to the system designer by the new parallel computers: “But visualize what it would be like to program for such a machine! If a thousand subroutines were in progress simultaneously, one must program the recognition that they have all finished, if one is to use their results together. The programmer must not only synchronize his subroutines, but schedule all his machine units, unless he is willing to have most of them sitting idle most of the time. Not only would programming techniques be vastly different from the serial ones in the serial languages we now use, but they would be humanly impossible without machine intervention.”

No. 1 (Tianhe-2, NUDT, China)
Top 500 (June 2015) No. 1 (Tianhe-2, NUDT, China) 3,120,000 cores (16,000 nodes x 195 cores), Petaflops, 17.8MW

What is Parallel Computing
Traditionally, software are written for serial computing – a problem is decomposed into one sequential task But, the real world is massively parallel Parallel computing - simultaneous use of multiple processing units to solve a problem fast (performance) What is the problem? Processing unit: A single processor with multiple cores A single computer with multiple processors A number of computers connected by a network Combinations of the above Ideally, a problem (application) is partitioned into sufficient number of independent parts for execution on parallel processing elements

Serial Computing problem processing unit in i2 i1 instructions A problem is divided into a discrete series of instructions Instructions are executed one after another Only one instruction executed at any moment in time

Parallel Computing instructions
problem processing unit0 task0 task1 processing unit1 task2 processing unit2 taskm processing unitp in i2 i1 1. A problem is divided into m discrete parts (tasks) that can be solved concurrently Each part is further broken down to a series of instructions (i) Instructions from each part execute in parallel on different processing units (p) 2. 3.

Multicore Processors Intel Xeon processor with 6 cores and 6 L3 cache units IBM Blue Gene/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2)

IBM Blue Gene/Q System Overview

Top 500 (June 2015) – No. 3 (DOE, US) IBM Blue Gene/Q: 1.57m cores (98,304 nodes x 16 cores), 17.1 Petaflops, 7.89MW 10

problem processing unit0 task0 task1 processing unit1 task2 processing unit2 taskm processing unitp in i2 i1 1. A problem is divided into m discrete parts (tasks) that can be solved concurrently Each part is further broken down to a series of instructions (i) Instructions from each part execute in parallel on different processing units (p) 2. 3.

problem processing unit0 task0 How many tasks (m)? How many processing units (p)? What is the size of m? What is the size of p? m = p, m > p, m < p? Where do we place data? … processing unit1 task1 task2 processing unit2 taskm processing unitp in i2 i1 1. A problem is divided into m discrete parts (tasks) that can be solved concurrently Each part is further broken down to a series of instructions (i) Instructions from each part execute in parallel on different processing units (p) 2. 3.

von Neumann Computation Model (1945)
Memory (holds instructions and data) P Control Unit (Instruction Counter) Arithmetic Logic Unit (Registers) R O C E S S O R input output

von Neumann Computation Model
A processor that performs instructions such as “add the contents of these two registers and put the result in that register”. A memory that stores both the instructions and data of a program in cells having unique addresses. A control scheme that fetches one instruction after another from the memory for execution by the processor, and shuttles data between memory and processor one word at a time. Memory wall -> disparity between processor (< 1 nanosecond or 10-9 sec) and memory speed (100-1,000 nanoseconds )

Why Parallel Computing
Primary reasons: Overcome limits of serial computing (end of processor clock frequency scaling) Save (wall-clock) time Solve larger problems Other reasons: Take advantage of non-local resources Cost saving – use multiple “cheaper (commoditized)” computing resources Overcome memory constraints Performance - exploits a large collection of processing units that can communicate and cooperate to solve large problems fast

Example Find the sum for a string of n numbers. sum = 0;
for (i = 0; i < n; i++) { x = Compute_next_value(. . .); sum += x; } Serial Code

Example Suppose we have p cores, p < n, then each core can performs a partial sum of n/p values my_sum = 0; my_first_i = . . .; my_last_i = . . .; for my_i = my_x = my_sum my_first_i; my_i < my_last_i; my_i++){ Compute_next_value(. . .); += my_x; } Each core uses it’s own private variables and executes this block of code independently of the other cores

Example After each core completes execution of the code, private variable my_sum in each core contains the sum of the values computed by its calls to Compute_next_value If n (=24) contains the following values 1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9 and p = 8 cores then the values of my_sum in each core are

Example Next, a core designated as the master core adds up all values of my_sum to form the global sum ( = 95) if (I’m the master core){ sum = my_x; for each core other than myself { receive value from other core sum += value; } } else { \\ not master core to the master; send my_x } Parallel Code 95 20

Example Next, a core designated as the master core adds up all values of my_sum to form the global sum ( = 95) if (I’m the master core){ sum = my_x; for each core other than myself { receive value from other core sum += value; } } else { \\ not master core to the master; send my_x } Parallel Code 95 CS3210: L01 - Introduction 21

Better Parallel Algorithm
Don’t make the master core do all the work, share it among the other cores (get help!) C O R E S 8 1 19 2 7 3 15 4 7 5 13 6 12 7 14 + 27 + 22 + 20 + 26 time + 49 + 46 + 95 22

Better Parallel Algorithm
Don’t make the master core do all the work, share it among the other cores (get help!) C O R E S 8 + 46 Number of additions performed by master core: 8 cores sequential – 7 serial additions parallel – 3 parallel additions factor of 2 improvement! 1,000 cores sequential – 999 serial additions parallel – 10 parallel additions factor of 100 improvement! 14 + 27 time + 49 + 95

Classical Science Nature Observation Physical Experimentation Theory

Modern Scientific Method
Nature Science in the 21st century - Parallel & Distributed Computing Observation Numerical Simulation Physical Experimentation Theory

Uses of Parallel Computing
Data mining, web search engines, web-based business and economic services, pharmaceutical design, financial modeling, etc.

TOP500 HPC Application Areas
source: top500.org

TOP 500 Performance over Time
name prefix multiplier exa E 1018 peta P 1015 tera T 1012 giga G 109 mega M 106 kilo K 103 source: top500.org

How to Increase Performance
Work hard higher clock frequency but limited by speed of light Work smart pipelining, superscalar processor Get help replication – multicore, cluster, .. 29

Computational Model Attributes
The primitive units of computation or basic actions of the computer (data types and operations defined by the instruction set). The definition of address spaces available to the computation (how data are accessed and stored) [data mechanism]. The schedulable units of computation (rules for partitioning and scheduling the problem for computation using the primitive units) [control mechanism]. The modes and patterns of communication among processing elements working in parallel, so that they can exchange needed information. Synchronization mechanisms to ensure that this information arrives at the right time. 30

Basics of Parallel Computing
Application Problem decompose Tasks compute Physical Cores & Processors

Decomposition One problem -> many possible decompositions
Complicated and laborious process Potential parallelism of should be split into tasks Size of tasks is called granularity – can choose different task sizes Defining the tasks is challenging and difficult to automate (why?) an application dictates how it

Tasks Tasks are coded in a parallel programming language
Scheduling – assignment of tasks to processes or threads Dictates the task execution order By hand in source code, at compile time or dynamically at runtime? Mapping – assignment of processes/threads to physical cores/processors for execution Tasks may depend on each other resulting in data or control dependences -> impose execution order of parallel tasks

Performance time versus throughput
Parallel execution time = computation time + data exchange or synchronization time Time depends on distribution of work (tasks) to processors, information exchange or synchronization, idle time where processors cannot do any useful work, … Challenge - finding a scheduling and mapping strategy that leads to good load balance and a small overhead

Challenges Elastic computation: How to decompose a problem into tasks and coordinates execution to keep resources fully utilized? Shared-memory model: How to share and how to share correctly? (automatic) extraction of parallelism – compilers, languages, .. Parallel programming and verification of program correctness Debugging of parallel program Performance monitoring • …

Types of Computer Systems
Single System Distributed Systems (multiple systems) PC/Workstation SMP/NUMA Vector Mainframe Clusters Client Server Grids Peer-to-Peer Cloud Control and Management Decentralized Centralised

Parallel & Distributed Systems
supercomputers Grids Cl ouds clusters Web 2.0/3.0 Scale application oriented service oriented

Evolution of Prominent Computers
year location name first 1834 Cambridge Difference engine Programmable computer 1943 Bletchley Colossus Electronic computer 1948 Manchester SSEM (Baby) Stored-program computer 1951 MIT Whirlwind 1 Real-time I/O computer 1953 Transistor computer Transistorized computer 1971 California Intel 4004 Mass-market CPU & IC 1979 Sinclair ZX-79 Mass-market home computer 1981 New York IBM PC Personal computer 1987 Acorn A400 series High-street RISC PC sales 1990 IBM RS6000 Superscalar RISC processor 1998 Sun picolJAVA Computer Based on a language 40

Progression of Computer Calculating Speeds
year Floating point operations per second (FLOPS) 1941 1 1945 100 1949 1,000 (1 KiloFLOPS, kFLOPS) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFLOPS, MFLOPS) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFLOPS, GFLOPS) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFLOPS, TFLOPS) 2000 10,000,000,000,000 2007 478,000,000,000,000 (478 TFLOPS) 2009 1,100,000,000,000,000 (1.1 PetaFLOPS)

International Electrotechnical
Computing Units International Electrotechnical Commission (IEC) International System of Units (SI) name prefix multiplier exa E 1018 peta P 1015 tera T 1012 giga G 109 mega M 106 kilo K 103 milli m 10-3 micro 10-6 nano n 10-9 pico p 10-12 Name prefix multiplier Exbi Ei 260 pebi Pi 250 tebi Ti 240 gibi Gi 230 mebi Mi 220 kibi Ki 210 Example of disk capacity: 20 Mibytes 20 MiB 20 Mebibytes = 20 x 220 = 20,971,520 bytes

TOP500 – June 2013

Parallel Computing Platforms
Processor Architecture & Memory Organization Memory Hierarchy & Interconnection Networks

Parallel Computing Application Problem decompose Tasks compute
application parallelism decompose Tasks compute hardware parallelism Physical Cores & Processors

Processor Architecture & Memory Organization
Top500 Supercomputer - Tianhe-2, 6/2015 (China) - 3,120,000 cores (16,000 nodes x 195 cores), Petaflops, 17.8MW

Processor Architecture & Memory Organization (2012/13)
- 1,572,864 cores (98,304 nodes x 16 cores), Petaflops, 7.9MW Top500 Supercomputer – 6/2012 (US)

Processor Architecture & Memory Organization (2011/12)
548,352 cores (86,544 processors x 8 cores), 8.162 Petaflops (8 x 1015 flops) racks, 12.6MW Top500 Supercomputer – 6/2011 (Japan)

TOP500 – June 2015

Part I.Parallel Computing

Similar presentations

Presentation on theme: "Part I.Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part I.Parallel Computing

Similar presentations

Presentation on theme: "Part I.Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback