University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Lecture 6: Multicore Systems
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Recap.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
University of Washington What is parallel processing? Winter 2015 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Processor Architecture
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
University of Washington Today Quick review? Parallelism Wrap-up 
Parallel Computing Presented by Justin Reschke
REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
University of Washington 1 What is parallel processing? When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
IT 251 Computer Organization and Architecture
REGISTER TRANSFER LANGUAGE (RTL)
Parallel Processing - introduction
Memory Consistency Models
Atomic Operations in Hardware
Atomic Operations in Hardware
Memory Consistency Models
Exploiting Parallelism
EE 193: Parallel Computing
Introduction.
Introduction.
Hardware Multithreading
Multithreading Why & How.
M4 and Parallel Programming
Shared-Memory Paradigm & OpenMP
Presentation transcript:

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster resources Concurrency: Correctly and efficiently manage access to shared resources requests work resource thanks to Dan Grossman for the succinct definitions 1

University of Washington What is parallel processing? Brief introduction to key ideas of parallel processing  instruction level parallelism  data-level parallelism  thread-level parallelism Spring 2014 Wrap-up 2

University of Washington Exploiting Parallelism Of the computing problems for which performance is important, many have inherent parallelism computer games  Graphics, physics, sound, AI etc. can be done separately  Furthermore, there is often parallelism within each of these:  Each pixel on the screen’s color can be computed independently  Non-contacting objects can be updated/simulated independently  Artificial intelligence of non-human entities done independently Search engine queries  Every query is independent  Searches are (pretty much) read-only!! Spring 2014 Wrap-up 3

University of Washington Instruction-Level Parallelism add %r2 <- %r3, %r4 or %r2 <- %r2, %r4 lw %r6 <- 0(%r4) addi %r7 <- %r6, 0x5 sub %r8 <- %r8, %r4 Dependencies? RAW – read after write WAW – write after write WAR – write after read When can we reorder instructions? add %r2 <- %r3, %r4 or %r5 <- %r2, %r4 lw %r6 <- 0(%r4) sub %r8 <- %r8, %r4 addi %r7 <- %r6, 0x5 When should we reorder instructions? Superscalar Processors: Multiple instructions executing in parallel at *same* stage Spring 2014 Wrap-up 4

University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } Operating on one element at a time + Spring 2014 Wrap-up 5

University of Washington Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Operating on one element at a time Spring 2014 Wrap-up 6

University of Washington Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } + Data Parallelism with SIMD + Operate on MULTIPLE elements ++ Single Instruction, Multiple Data (SIMD) Spring 2014 Wrap-up 7

University of Washington Is it always that easy? Not always… a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; } Is there parallelism here? Each loop iteration uses data from previous iteration. Spring 2014 Wrap-up 8

University of Washington Restructure the code for SIMD… // one option... unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; // chunks of 4 at a time for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } // add the 4 sub-totals total = temp[0] + temp[1] + temp[2] + temp[3]; // add the non-4-aligned parts for ( ; i < length ; ++ i) { total += array[i]; } return total; } Spring 2014 Wrap-up 9

University of Washington What are threads? Independent “thread of control” within process Like multiple processes within one process, but sharing the same virtual address space.  logical control flow  program counter  stack  shared virtual address space  all threads in process use same virtual address space Lighter-weight than processes  faster context switching  system can support more threads Spring Wrap-up

University of Washington Thread-level parallelism: multi-core processors Two (or more) complete processors, fabricated on the same silicon chip Execute instructions from two (or more) programs/threads at same time #1 #2 IBM Power5 Spring 2014 Wrap-up 11

University of Washington Multi-cores are everywhere (circa 2013) Laptops, desktops, servers  Most any machine from the past few years has at least 2 cores Game consoles:  Xbox 360: 3 PowerPC cores; Xbox One: 8 AMD cores  PS3: 9 Cell cores (1 master; 8 special SIMD cores); PS4: 8 custom AMD x86-64 cores  Wii U: 2 Power cores Smartphones  iPhone 4S, 5: dual-core ARM CPUs  Galaxy S II, III, IV: dual-core ARM or Snapdragon  … Spring Wrap-up

University of Washington Why Multicores Now? Number of transistors we can put on a chip growing exponentially… But performance is no longer growing along with transistor count. So let’s use those transistors to add more cores to do more at once… Spring 2014 Wrap-up 13

University of Washington What happens if we run this program on a multicore? void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } As programmers, do we care? #1#2 Spring 2014 Wrap-up 14

University of Washington What if we want one program to run on multiple processors (cores)? We have to explicitly tell the machine exactly how to do this  This is called parallel programming or concurrent programming There are many parallel/concurrent programming models  We will look at a relatively simple one: fork-join parallelism Spring 2014 Wrap-up 15

University of Washington How does this help performance? Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processors What can we realistically expect? speedup(p) = Spring 2014 Wrap-up 16

University of Washington In general, the whole computation is not (easily) parallelizable Serial regions limit the potential parallel speedup. Reason #1: Amdahl’s Law Serial regions Spring 2014 Wrap-up 17

University of Washington Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? What is the maximum speedup from parallelization? Reason #1: Amdahl’s Law New Execution Time = 1-s +s p New Execution Time =.9T +.1T= 3 Speedup = Spring 2014 Wrap-up 18

University of Washington — Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system —Depends on the implementation Reason #2: Overhead New Execution Time = 1-s +s+overhead(P) P Spring Wrap-up

University of Washington Multicore: what should worry us? Concurrency: what if we’re sharing resources, memory, etc.? Cache Coherence  What if two cores have the same data in their own caches? How do we keep those copies in sync? Memory Consistency, Ordering, Interleaving, Synchronization…  With multiple cores, we can have truly concurrent execution of threads. In what order do their memory accesses appear to happen? Do the orders seen by different cores/threads agree? Concurrency Bugs  When it all goes wrong…  Hard to reproduce, hard to debug  about-shared-variables-or-memory-models/fulltext about-shared-variables-or-memory-models/fulltext Spring Wrap-up

University of Washington Multicore: more than one processor on the same chip  Almost all devices now have multicore processors  Results from Moore’s law and power constraint Exploiting multicore requires parallel programming  Automatically extracting parallelism too hard for compiler, in general.  But, can have compiler do much of the bookkeeping for us Fork-Join model of parallelism  At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread  Expect a speedup of less than P on P processors  Amdahl’s Law: speedup limited by serial portion of program  Overhead: forking and joining are not free Take 332, 352, 451 to learn more! Summary Spring 2014 Wrap-up 21