CSE-700 Parallel Programming Introduction POSTECH Sep 6, 2007 박성우.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Lecture 6: Multicore Systems
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Introductory Courses in High Performance Computing at Illinois David Padua.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
Reference: Message Passing Fundamentals.
Introduction CS 524 – High-Performance Computing.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
CS 3013 & CS 502 Summer 2006 Threads1 CS-3013 & CS-502 Summer 2006.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
CS 470/570:Introduction to Parallel and Distributed Computing.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Dilemma of Parallel Programming Xinhua Lin ( 林新华 ) HPC Lab of 17 th Oct 2011.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Software & the Concurrency Revolution by Sutter & Larus ACM Queue Magazine, Sept For CMPS Halverson 1.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CSE-321 Programming Languages Overview POSTECH March 5, 2007 박성우.
CSE-321 Programming Languages Overview POSTECH March 3, 2009 박성우.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Overview of Operating Systems Introduction to Operating Systems: Module 0.
CSE-321 Programming Languages Overview POSTECH March 2, 2010 박성우.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
University of Washington Today Quick review? Parallelism Wrap-up 
Parallel Computing Presented by Justin Reschke
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
1 Announcements HW8 due today HW9: Java Concurrency Will post tonight Due May 9 th Java infrastructure Download the Eclipse IDE Rainbow grades HW 1-6 Exam.
Realizing Concurrency using the thread model
Realizing Concurrency using the thread model
CS4961 Parallel Programming Lecture 11: Data Locality, cont
3- Parallel Programming Models
Computer Engg, IIT(BHU)
EE 193: Parallel Computing
Realizing Concurrency using the thread model
Shared Memory Programming
EE 4xx: Computer Architecture and Performance Programming
Lecture 2 The Art of Concurrency
Memory System Performance Chapter 3
Programming with Shared Memory Specifying parallelism
Types of Parallel Computers
CSE 332: Concurrency and Locks
Presentation transcript:

CSE-700 Parallel Programming Introduction POSTECH Sep 6, 2007 박성우

2 Common Features?

3... runs faster on

4 Multi-core CPUs IBM Power4, dual-core, 2000 Intel reaches thermal wall, 2004 ) no more free lunch! Intel Xeon, quad-core, 2006 Sony PlayStation 3 Cell, eight cores enabled, 2006 Intel, 80-cores, 2011 (prototype finished) source: Herb Sutter - "Software and the concurrency revolution"

5 Parallel Programming Models Posix threads (API) OpenMP (API) HPF (High Performance Fortran) Cray's Chapel Nesl Sun's Fortress IBM's X10... and a lot more.

6 Parallelism Data parallelism –ability to apply a function in parallel to each element of a collection of data Thread parallelism –ability to run multiple threads concurrently –Each thread uses its own local state. Shared memory parallelism

Data Parallelism Thread Parallelism Shared Memory Parallelism

8 Data Parallelism = Data Separation a1a1 a2a2... anan a n+1 a n+2... a n+m a n+m+1... a n+m+l hardware thread #1 hardware thread #2 hardware thread #3

9 Data Parallelism in Hardware GeForce 8800 –128 stream 1.3Ghz, 500+GFlops

10 Data Parallelism in Programming Languages Fortress –parallelism is the default. for i à 1:m, j à 1:n do // 1:n is a generator a[i, j] := b[i] c[j] end Nesl (1990's) –supports nested data parallelism the function being applied itself can be parallel. {sum(a) : a in [[2, 3], [8, 3, 9], [7]]};

11 Data Parallel Haskell (DAMP '07) Haskell + nested data parallelism –flattening (vectorization) transforms a nested parallel program such that it manipulates only flat arrays. –fusion eliminate many intermediate arrays Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements

Data Parallelism Thread Parallelism Shared Memory Parallelism

13 Thread Parallelism hardware thread #1 hardware thread #2 local state message synchronous communication

14 Pure Functional Threads Purely functional threads can run concurrently. –Effect-free computations can be executed in parallel with any other effect-free computations. Example: collision-detection A A' B B'

15 Manticore (DAMP '07) Three layers –sequential base language functional language drawn from SML no mutable references and arrays! –data-parallel programming Implicit: –the compiler and runtime system manage thread creation. E.g.) parallel arrays of parallel arrays [: 2 * n | n in nums where n > 0 :] fun mapP f xs = [: f x | x in xs :] –concurrent programming

16 Concurrent Programming in Manticore (DAMP '07) Based on Concurrent ML –threads and synchronous message passing –Threads do not share mutable states. actually no mutable references and arrays –explicit: The programmer manages thread creation.

Data Parallelism Thread Parallelism Shared Memory Parallelism (Shared State Concurrency)

18 Share Memory Parallelism shared memory hardware thread #1 hardware thread #2 hardware thread #3

19 World War II

20 Company of Heroes Interaction of a LOT of objects: –thousands of objects –Each object has its own mutable state. –Each object update affects several other objects. –All objects are updated 30+ times per second. Problem: –How do we handle simultaneous updates to the same memory location?

21 Manual Lock-based Synchronization pthread_mutex_lock(mutex); mutate_variable(); pthread_mutex_unlock(mutex); Locks and conditional variables ) fundamentally flawed!

22 Bank Accounts Beautiful Concurrency, Peyton Jones, 2007 account A thread #1thread #2thread #n account B... transfer request transfer request transfer request shared memory Invariant: atomicity –no thread observes a state in which the money has left one account, but has not arrived in the other.

23 Bank Accounts using Locks In an object-oriented language: class Account { Int balance; synchronized void deposit (Int n) { balance = balance + n; }} Code for transfer: void transfer (Account from, Account to, Int amount) { from.withdraw (amount); to.deposit (amount); } an intermediate state!

24 A Quick Fix: Explicit Locking void transfer (Account from, Account to, Int amount) { from.lock(); to.lock(); from.withdraw (amount); to.deposit (amount); from.unlock(); to.unlock(); } Now, the program is prone to deadlock.

25 Locks are Bad Taking two few locks ) simultaneous update Taking too many locks ) no concurrency or deadlock Taking the wrong locks ) error-prone programming Taking locks in the wrong order ) error-prone programming... Fundamental problem: no modular programming –Correct implementations of withdraw and deposit do not give a correct implementation of transfer.

26 Transactional Memory An alternative to lock-based synchronization –eliminates many problems associated with lock- based synchronization no deadlock read sharing safe modular programming Hot research area –hardware transactional memory –software transactional memory C, Java, functional languages,...

27 Transactions in Haskell transfer :: Account -> Account -> Int -> IO () -- transfer 'amount' from account 'from' to account 'to' transfer from to amount = atomically (do { deposit to amount ; withdraw from amount }) atomically act –atomicity: the effects become visible to other threads all at once. –isolation: the action act does not see any effects from other threads.

Conclusion: We need parallelism!

29 Tim Sweeney's POPL '06 Invited Talk - Last Slide

CSE-700 Parallel Programming Fall 2007

31 CSE-700 in a Nutshell Scope –Parallel computing from the viewpoint of programmers and language designers –We will not talk about hardware for parallel computing Audience –Anyone interested in learning parallel programming Prerequisite –C programming –Desire to learn new programming languages

32 Material Books –Introduction to Parallel Programming (2nd). Ananth Grama et al. –Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al. Any textbook on MPI and OpenMP is fine. Papers

33 Teaching Staff Instructors –Gla –Myson –... –and YOU! We will lead this course TOGETHER.

34 Resources Plquad –quad-core Linux –OpenMP and MPI already installed Ask for an account if you need one.

35 Basic Plan - First Half Goal –learn the basics of parallel programming through 5+ assignments on OpenMP and MPI Each lecture consists of: –discussion on the previous assignment Each of you is expected to give a presentation. –presentation on OpenMP and MPI by the instructors –discussion on the next assignment

36 Basic Plan - Second Half Recent parallel languages –learn a recent parallel language –write a cool program in your parallel language –give a presentation on your experience Topics in parallel language research –choose a topic –give a presentation on it

37 What Matters Most? Spirit of adventure Proactivity Desire to provoke Happy Chaos –I want you to develop this course into a total, complete, yet happy chaos. –A truly inspirational course borders almost on chaos.

Impact of Memory and Cache on Performance

39 Impact of Memory Bandwidth [1] Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vector column_sum.

40 Impact of Memory Bandwidth [2] The vector column_sum is small and easily fits into the cache The matrix b is accessed in a column order. The strided access results in very poor performance. Multiplying a matrix with a vector: (a) multiplying column-by- column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

41 Impact of Memory Bandwidth [3] We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.

42 Lesson Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality.

Assignment 1 Cache & Matrix Multiplication

44 Typical Sequential Implementation A : n x n B : n x n C = A * B: n x n for i = 1 to n for j = 1 to n C[i, j] = 0; for k = 1 to n C[i, j] += A[i, k] * B [k, j];

45 Using Submatrixes Improves data locality significantly.

46 Experimental Results

47 Assignment 1 Machine –the older, the better. –Myson offers his ancient notebook for you. Pentium II 600Mhz no L1 cache 64KB L2 cache running Linux Prepare a presentation on your experimental results.