Lecture 2 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

The University of Adelaide, School of Computer Science

Introduction to Parallel Computing

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Reference: Message Passing Fundamentals.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

Topic 1: Introduction to Computers and Programming

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Prince Sultan College For Woman

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 2 CSS314 Parallel Computing

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Multi-core architectures. Single-core computer Single-core CPU chip.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

Parallel Computing.

Processor Architecture

Lecture 3 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

1 Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming.

EKT303/4 Superscalar vs Super-pipelined.

Lecture 3: Computer Architectures

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Lecture 3 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco PhD, Bogdanchikov.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Parallel Computing Presented by Justin Reschke

M211 – Central Processing Unit

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

Agenda  Quick Review  Finish Introduction  Java Threads.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.

Computer Systems Architecture Edited by Original lecture by Ian Sunley Areas: Computer users Basic topics What is a computer?

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

Distributed Processors

Parallel Processing - introduction

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers

Multi-Processing in High Performance Computer Architecture:

Pipelining and Vector Processing

Chapter 17 Parallel Processing

AN INTRODUCTION ON PARALLEL PROCESSING

Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.

The University of Adelaide, School of Computer Science

Presentation transcript:

Lecture 2 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco

Content Parallel Computing History Types of Parallel Computing Architecture Understanding Hadoop and MapReduce Writing and running a basic MapReduce program

Why we need to write parallel programs Most programs that have been written for conventional, single-core systems cannot exploit the presence of multiple cores. In order to do this, we need to either rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores, or write translation programs, that is, programs that will automatically convert serial programs into parallel programs. The bad news is that researchers have had very limited success writing programs that convert serial programs in languages such as C and C++ into parallel programs.

Example An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelizations of each of its steps. Rather, the best parallelization may be obtained by stepping back and devising an entirely new algorithm. As an example, suppose that we need to compute n values and add them together:

Now suppose we also have p cores and p is much smaller than n. Then each core can form a partial sum of approximately n/p values:

For example, if there are eight cores, n=24, and the 24 calls to Compute_next_value return the values: 1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9, then the values stored in my_sum might be:

When the cores are done computing their values of my_sum, they can form a global sum by sending their results to a designated “master” core, which can add their results: if (I’m the master core) { sum = my_x; for each core other than myself { receive value from core; sum += value; } } else { send my_x to the master }

In our example, if the master core is core 0, it would add the values =95. But you can probably see a better way to do this—especially if the number of cores is large. Instead of making the master core do all the work of computing the final sum, we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5 and so on.

Multiple cores forming a global sum

How do we write parallel programs? There are a number of possible answers to this question, but most of them depend on the basic idea of partitioning the work to be done among the cores. There are two widely used approaches: task- parallelism and data-parallelism. In task-parallelism, we partition the various tasks carried out in solving the problem among the cores. In data-parallelism, we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data.

There are two main types of parallel systems that we’ll be focusing on: shared-memory systems and distributed-memory systems. In a shared-memory system, the cores can share access to the computer’s memory. In a distributed-memory system, each core has its own, private memory, and the cores must communicate explicitly by doing something like sending messages across a network.

(a) A shared-memory system and (b) a distributed-memory system

Processes, multitasking, and threads Recall that the operating system, or OS, is a major piece of software whose purpose is to manage hardware and software resources on a computer. It determines which programs can run and when they can run. It also controls the allocation of memory to running programs and access to peripheral devices such as hard disks and network interface cards. When a user runs a program, the operating system creates a process — an instance of a computer program that is being executed.

Most modern operating systems are multitasking. This means that the operating system provides support for the apparent simultaneous execution of multiple programs. This is possible even on a system with a single core, since each process runs for a small interval of time (typically a few milliseconds), often called a time slice. After one running program has executed for a time slice, the operating system can run a different program. A multitasking OS may change the running process many times a minute, even though changing the running process can take a long time.

Instruction-level parallelism Instruction-level parallelism, or ILP, attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions. There are two main approaches to ILP: pipelining, in which functional units are arranged in stages, and multiple issue, in which multiple instructions can be simultaneously initiated. Both approaches are used in virtually all modern CPUs.

Pipelining The principle of pipelining is similar to a factory assembly line: while one team is bolting a car’s engine to the chassis, another team can connect the transmission to the engine and the driveshaft of a car that’s already been processed by the first team, and a third team can bolt the body to the chassis in a car that’s been processed by the first two teams.

Example As an example involving computation, suppose we want to add the floating point numbers 9.87x10 4 and 6.54x10 3. Then we can use the following steps:

Example Now if each of the operations takes one nanosecond (10 -9 seconds), the addition operation will take seven nanoseconds. So if we execute the code: float x[1000], y[1000], z[1000];... for (i = 0; i < 1000; i++) z[i] = x[i] + y[i]; the for loop will take something like 7000 nanoseconds.

Pipelining As an alternative, suppose we divide our floating point adder into seven separate pieces of hardware or functional units. The first unit will fetch two operands, the second will compare exponents, and so on. Also suppose that the output of one functional unit is the input to the next. Then a single floating point addition will still take seven nanoseconds. However, when we execute the for loop, we can fetch x[1] and y[1] while we’re comparing the exponents of x[0] and y[0]. Simultaneously execute seven different stages in seven different additions. Reduced from 7000 nanoseconds to 1006 nanoseconds

Multiple issue Pipelines improve performance by taking individual pieces of hardware or functional units and connecting them in sequence. Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a program. For example, if we have two complete floating point adders, we can approximately halve the time it takes to execute the loop for (i = 0; i < 1000; i++) z[i] = x[i] + y[i];

Hardware multithreading ILP can be very difficult to exploit: it is a program with a long sequence of dependent statements offers few opportunities. For example, in a direct calculation of the Fibonacci numbers f[0] = f[1] = 1; for (i = 2; i <= n; i++) f[i] = f[i-1] + f[i-2]; there’s essentially no opportunity for simultaneous execution of instructions.

PARALLEL HARDWARE Multiple issue and pipelining can clearly be considered to be parallel hardware, since functional units are replicated. However, since this form of parallelism isn’t usually visible to the programmer, we’re treating both of them as extensions to the basic von Neumann model, and for our purposes, parallel hardware will be limited to hardware that’s visible to the programmer. In other words, if source code must be modified to exploit it, then we’ll consider the hardware to be parallel.

SISD,SIMD,MIMD systems In parallel computing, Flynn’s taxonomy is frequently used to classify computer architectures. It classifies a system according to the number of instruction streams and the number of data streams it can simultaneously manage. A classical von Neumann system is therefore a single instruction stream, single data stream, or SISD system, since it executes a single instruction at a time and it can fetch or store one item of data at a time.

SIMD systems Single instruction, multiple data, or SIMD, systems are parallel systems. As the name suggests, SIMD systems operate on multiple data streams by applying the same instruction to multiple data items. So an abstract SIMD system can be thought of as having a single control unit and multiple ALUs. An instruction is broadcast from the control unit to the ALUs, and each ALU either applies the instruction to the current data item, or it is idle.

Example As an example, suppose we want to carry out a “vector addition.” That is, suppose we have two arrays x and y, each with n elements, and we want to add the elements of y to the elements of x: for (i = 0; i < n; i++) x[i] += y[i]; Suppose further that our SIMD system has n ALUs. Then we could load x[i] and y[i] into the ith ALU, have the ith ALU add y[i] to x[i], and store the result in x[i]. What if the system has m ALUs and m < n?

Vector processors Although what constitutes a vector processor has changed over the years, their key characteristic is that they can operate on arrays or vectors of data, while conventional CPUs operate on individual data elements or scalars. Typical recent systems have the following characteristics: Vector registers. Vectorized and pipelined functional units Vector instructions Interleaved memory Strided memory access and hardware scatter/gather

Graphics processing units Real-time graphics application programming interfaces, or APIs, use points, lines, and triangles to internally represent the surface of an object. They use a graphics processing pipeline to convert the internal representation into an array of pixels that can be sent to a computer screen. GPUs can optimize performance by using SIMD parallelism, and in the current generation all GPUs use SIMD parallelism. This is obtained by including a large number of ALUs (e.g., 80) on each GPU processing core.

MIMD systems Multiple instruction, multiple data, or MIMD, systems support multiple simultaneous instruction streams operating on multiple data streams. Thus, MIMD systems typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. Furthermore, unlike SIMD systems, MIMD systems are usually asynchronous, that is, the processors can operate at their own pace. In fact, unless the programmer imposes some synchronization, at any given instant they may be executing different statements of same code.

Distributed-memory systems The most widely available distributed-memory systems are called clusters. They are composed of a collection of commodity systems—for example, PCs—connected by a commodity interconnection network—for example, Ethernet. In fact, the nodes of these systems, the individual computational units joined together by the communication network, are usually shared-memory systems with one or more multicore processors. To distinguish such systems from pure distributed- memory systems, they are sometimes called hybrid systems.

Latency and bandwidth Any time data is transmitted, we’re interested in how long it will take for the data to reach its destination. There are two terms that are often used to describe the performance of an interconnect: the latency and the bandwidth. The latency is the time that elapses between the source’s beginning to transmit the data and the destination’s starting to receive the first byte. (L = sec) The bandwidth is the rate at which the destination receives data after it has started to receive the first byte. (B = bytes per second) message transmission time = L+ n / B.

THANK YOU THE END