Download presentation
Presentation is loading. Please wait.
Published byMaude Robbins Modified over 9 years ago
1
Lecture 2 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco
2
Content Parallel Computing History Types of Parallel Computing Architecture Understanding Hadoop and MapReduce Writing and running a basic MapReduce program
3
Why we need to write parallel programs Most programs that have been written for conventional, single-core systems cannot exploit the presence of multiple cores. In order to do this, we need to either rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores, or write translation programs, that is, programs that will automatically convert serial programs into parallel programs. The bad news is that researchers have had very limited success writing programs that convert serial programs in languages such as C and C++ into parallel programs.
4
Example An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelizations of each of its steps. Rather, the best parallelization may be obtained by stepping back and devising an entirely new algorithm. As an example, suppose that we need to compute n values and add them together:
5
Now suppose we also have p cores and p is much smaller than n. Then each core can form a partial sum of approximately n/p values:
6
For example, if there are eight cores, n=24, and the 24 calls to Compute_next_value return the values: 1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9, then the values stored in my_sum might be:
7
When the cores are done computing their values of my_sum, they can form a global sum by sending their results to a designated “master” core, which can add their results: if (I’m the master core) { sum = my_x; for each core other than myself { receive value from core; sum += value; } } else { send my_x to the master }
8
In our example, if the master core is core 0, it would add the values 8+19+7+15+7+13+12+14=95. But you can probably see a better way to do this—especially if the number of cores is large. Instead of making the master core do all the work of computing the final sum, we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5 and so on.
9
Multiple cores forming a global sum
10
How do we write parallel programs? There are a number of possible answers to this question, but most of them depend on the basic idea of partitioning the work to be done among the cores. There are two widely used approaches: task- parallelism and data-parallelism. In task-parallelism, we partition the various tasks carried out in solving the problem among the cores. In data-parallelism, we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data.
11
There are two main types of parallel systems that we’ll be focusing on: shared-memory systems and distributed-memory systems. In a shared-memory system, the cores can share access to the computer’s memory. In a distributed-memory system, each core has its own, private memory, and the cores must communicate explicitly by doing something like sending messages across a network.
12
(a) A shared-memory system and (b) a distributed-memory system
13
Processes, multitasking, and threads Recall that the operating system, or OS, is a major piece of software whose purpose is to manage hardware and software resources on a computer. It determines which programs can run and when they can run. It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program, the operating system creates a process — an instance of a computer program that is being executed.
14
Most modern operating systems are multitasking. This means that the operating system provides support for the apparent simultaneous execution of multiple programs. This is possible even on a system with a single core, since each process runs for a small interval of time (typically a few milliseconds), often called a time slice. After one running program has executed for a time slice, the operating system can run a different program. A multitasking OS may change the running process many times a minute, even though changing the running process can take a long time.
15
Instruction-level parallelism Instruction-level parallelism, or ILP, attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions. There are two main approaches to ILP: pipelining, in which functional units are arranged in stages, and multiple issue, in which multiple instructions can be simultaneously initiated. Both approaches are used in virtually all modern CPUs.
16
Pipelining The principle of pipelining is similar to a factory assembly line: while one team is bolting a car’s engine to the chassis, another team can connect the transmission to the engine and the driveshaft of a car that’s already been processed by the first team, and a third team can bolt the body to the chassis in a car that’s been processed by the first two teams.
17
Example As an example involving computation, suppose we want to add the floating point numbers 9.87x10 4 and 6.54x10 3. Then we can use the following steps:
18
Example Now if each of the operations takes one nanosecond (10 -9 seconds), the addition operation will take seven nanoseconds. So if we execute the code: float x[1000], y[1000], z[1000];... for (i = 0; i < 1000; i++) z[i] = x[i] + y[i]; the for loop will take something like 7000 nanoseconds.
19
Pipelining As an alternative, suppose we divide our floating point adder into seven separate pieces of hardware or functional units. The first unit will fetch two operands, the second will compare exponents, and so on. Also suppose that the output of one functional unit is the input to the next. Then a single floating point addition will still take seven nanoseconds. However, when we execute the for loop, we can fetch x[1] and y[1] while we’re comparing the exponents of x[0] and y[0]. Simultaneously execute seven different stages in seven different additions. Reduced from 7000 nanoseconds to 1006 nanoseconds
20
Multiple issue Pipelines improve performance by taking individual pieces of hardware or functional units and connecting them in sequence. Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a program. For example, if we have two complete floating point adders, we can approximately halve the time it takes to execute the loop for (i = 0; i < 1000; i++) z[i] = x[i] + y[i];
21
Hardware multithreading ILP can be very difficult to exploit: it is a program with a long sequence of dependent statements offers few opportunities. For example, in a direct calculation of the Fibonacci numbers f[0] = f[1] = 1; for (i = 2; i <= n; i++) f[i] = f[i-1] + f[i-2]; there’s essentially no opportunity for simultaneous execution of instructions.
22
PARALLEL HARDWARE Multiple issue and pipelining can clearly be considered to be parallel hardware, since functional units are replicated. However, since this form of parallelism isn’t usually visible to the programmer, we’re treating both of them as extensions to the basic von Neumann model, and for our purposes, parallel hardware will be limited to hardware that’s visible to the programmer. In other words, if source code must be modified to exploit it, then we’ll consider the hardware to be parallel.
23
SISD,SIMD,MIMD systems In parallel computing, Flynn’s taxonomy is frequently used to classify computer architectures. It classifies a system according to the number of instruction streams and the number of data streams it can simultaneously manage. A classical von Neumann system is therefore a single instruction stream, single data stream, or SISD system, since it executes a single instruction at a time and it can fetch or store one item of data at a time.
24
SIMD systems Single instruction, multiple data, or SIMD, systems are parallel systems. As the name suggests, SIMD systems operate on multiple data streams by applying the same instruction to multiple data items. So an abstract SIMD system can be thought of as having a single control unit and multiple ALUs. An instruction is broadcast from the control unit to the ALUs, and each ALU either applies the instruction to the current data item, or it is idle.
25
Example As an example, suppose we want to carry out a “vector addition.” That is, suppose we have two arrays x and y, each with n elements, and we want to add the elements of y to the elements of x: for (i = 0; i < n; i++) x[i] += y[i]; Suppose further that our SIMD system has n ALUs. Then we could load x[i] and y[i] into the ith ALU, have the ith ALU add y[i] to x[i], and store the result in x[i]. What if the system has m ALUs and m < n?
26
Vector processors Although what constitutes a vector processor has changed over the years, their key characteristic is that they can operate on arrays or vectors of data, while conventional CPUs operate on individual data elements or scalars. Typical recent systems have the following characteristics: Vector registers. Vectorized and pipelined functional units Vector instructions Interleaved memory Strided memory access and hardware scatter/gather
27
Graphics processing units Real-time graphics application programming interfaces, or APIs, use points, lines, and triangles to internally represent the surface of an object. They use a graphics processing pipeline to convert the internal representation into an array of pixels that can be sent to a computer screen. GPUs can optimize performance by using SIMD parallelism, and in the current generation all GPUs use SIMD parallelism. This is obtained by including a large number of ALUs (e.g., 80) on each GPU processing core.
28
MIMD systems Multiple instruction, multiple data, or MIMD, systems support multiple simultaneous instruction streams operating on multiple data streams. Thus, MIMD systems typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. Furthermore, unlike SIMD systems, MIMD systems are usually asynchronous, that is, the processors can operate at their own pace. In fact, unless the programmer imposes some synchronization, at any given instant they may be executing different statements of same code.
29
Distributed-memory systems The most widely available distributed-memory systems are called clusters. They are composed of a collection of commodity systems—for example, PCs—connected by a commodity interconnection network—for example, Ethernet. In fact, the nodes of these systems, the individual computational units joined together by the communication network, are usually shared-memory systems with one or more multicore processors. To distinguish such systems from pure distributed- memory systems, they are sometimes called hybrid systems.
30
Latency and bandwidth Any time data is transmitted, we’re interested in how long it will take for the data to reach its destination. There are two terms that are often used to describe the performance of an interconnect: the latency and the bandwidth. The latency is the time that elapses between the source’s beginning to transmit the data and the destination’s starting to receive the first byte. (L = sec) The bandwidth is the rate at which the destination receives data after it has started to receive the first byte. (B = bytes per second) message transmission time = L+ n / B.
31
THANK YOU THE END
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.