Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program.

Similar presentations


Presentation on theme: "Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program."— Presentation transcript:

1 Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program

2  Time and Location: 6:00 - 8:00 PM, Mondays, May 2, 9, 16, 23  Course Website: http://domemtech.com/ieee-pp http://domemtech.com/ieee-pp  Instructor: Ken Domino, kenneth.domino@domemtech.com kenneth.domino@domemtech.com

3  Recommended Textbooks: CUDA by Example: An Introduction to General- Purpose GPU Programming, by J. Sanders and E. Kandrot, ©2010, ISBN 9780131387683 Programming massively parallel processors: A Hands-on approach, by D. Kirk and W. Wen-mei, ©2010, ISBN 9780123814722

4  Recommended Textbooks: Principles of Parallel Programming, by Calvin Lin and Larry Snyder, © 2008, ISBN 9780321487902 Introduction to parallel algorithms, by Xavier, C. and S. Iyengar,© 1998, ISBN 9780471251828 Patterns for Parallel Programming, by Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill, © 2004, ISBN 9780321228116

5  Other material  Original research papers (see reference list) Uzi Vishkin, http://www.umiacs.umd.edu/~vishkin/index.shtml http://www.umiacs.umd.edu/~vishkin/index.shtml Class notes on Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 2010, http://www.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf http://www.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf

6  CPU’s have been getting faster…but that stopped in mid-2000’s. Why?

7 Pollack FJ. New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only). Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. Haifa, Israel: IEEE Computer Society; 1999:2.

8  Problems can be solved in much faster times. Predictive protein binding “Meet Tanuki, a 10,000-core Supercomputer in the Cloud” Based on Amazon EC2 Cloud service, client Genentech Compute time reduced from about a month to 8 hours http://www.bio-itworld.com/news/04/25/2011/Meet-Tanuki-ten-thousand-core- supercomputer-in-cloud.html

9  Computer vision OpenVIDIA: Parallel GPU Computer Vision Solves problems for Segmentation, Stereo Vision, Optical Flow and Feature Tracking http://openvidia.sourceforge.net/index.php/OpenVIDIA http://psychology.wikia.com/wiki/Computer_ vision

10  Army: “Want computers to work like the human brain” http://www.wired.com/dangerroom/2011/04/army-wants-a-computer-that-acts-like-a-brain/

11  Where are we going? o NVIDIA has Fermi GPU GeForce GTX 590 with 1024-core processor (2011), programmable using CUDA, ~2500 GFLOPS. o In 2005, Intel started manufacturing dual-core CPU’s. o In 2010, Intel and AMD are manufacturing six-core CPU’s, ~11 GFLOPS (non-SSE). o In 2012, Intel will introduce Knights Corner, a 50-core processor. CPU vs GPU

12  The largest supercomputer is the Tianhe-IA (Nov 2010 http://www.top500.org/ ) http://www.top500.org/  7168 Xeon X5670 6- core processors  7168 Nvidia M2050 GPU processors with 448 CUDA Cores Wikipedia.org. Tianhe-I, 2010.

13  One of the “seven” up and coming language [Wayner 2010]Wayner 2010  Brings parallel computing to the common man.  For one GPU, a speed ups of 100 times or more over a serial CPU solution is common.  Used in many different applications.  Coming to mobile devices.

14  A task is a sequence of instructions that executes as a group.  Tasks continue until halt, exit, or return.

15  Computers do not directly execute tasks.  Computer execute Instructions, which are used to model a task.

16  Execution of tasks are not concurrent and not simultaneous.  A sequence of tasks is called a thread. Step 1Step 3Step 2

17  Execution of tasks of multiple threads are concurrent but not necessarily simultaneous. Step 1Step 3Step 2 Step 1Step 3Step 2

18  Execution of tasks of multiple threads are concurrent and simultaneously executing on multiple machines.  Goal is minimized time and work. Step 1Step 3Step 2 Step 1Step 3Step 2

19  Nowadays, many people use the terms interchangeably. Lin, Y. and L. Snyder (2009). Why?  Since the tasks of the threads can occur in any order, so behavior is unpredicatable. Read xSet x = ySet y = x + 1 Read xSet x = ySet y = x + 2 Read xSet x = ySet y = x + 3

20 Ld %r1, 1 Ld %r2, mem St [%r2], %r1

21 Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.RISC

22 An instruction pipeline is a technique used in the design ofcomputers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).computers Unfortunately, not all instructions are independent!

23 f = 1 e = 2 a = b = c = d = 3 s1. e = a + b s2. f = c + d s3. g = e * f e = a + b f = c + d g = e * f Result: g = 36

24 f = 1 e = 2 a = b = c = d = 3 s1. e = a + b s2. f = c + d s3. g = e * f e = a + b f = c + d g = e * f Result: g = 6

25 f = 1 e = 2 a = b = c = d = 3 s1. e = a + b s2. f = c + d s3. g = e * f e = a + b f = c + d g = e * f “s3 is flow dependent on s1” There are other types of dependencies.

26  Thread-level parallelism = task parallelism  Example, recalculation of a spreadsheet

27  Process-level parallelism  Example, two independent programs (Freecell and Email)  Granularity is the size of the problem (e.g., instruction vs. thread vs. process)

28  What is speed up?  p = number of processors

29 TaTa TaTa TbTb TbTb TcTc TcTc TdTd TdTd TeTe TeTe TaTa TaTa TbTb TbTb TcTc TcTc TdTd TdTd TeTe TeTe What is the time of computation if b, c, d, e are tasks that can be run in parallel on four processors? Serial computation: T a … T e

30  Amdahl’s law TaTa TaTa TbTb TbTb TcTc TcTc TdTd TdTd TeTe TeTe TbTb TbTb TcTc TcTc TdTd TdTd TeTe TeTe TaTa TaTa Let f = fraction of time that must be serially executed

31  Amdahl’s law Question: If a problem is not parallelizable by even only a small fraction, throwing more processors at a problem will not help speed it up. So, why try for a parallel solution? Para dox Gustafson? (G.’s Law is equivalent to Amdahl…)

32  Question: If a problem is not parallelizable by even only a small fraction, throwing more processors at a problem will not help speed it up. So, why try for a parallel solution?  Answer: A prerequisite to applying Amdahl’s or Gustafson’s formulation is that the serial and parallel programs take the same number of total calculation steps for the same input.

33  Use of a resource constrained serial execution as the base for speedup calculation; and  Use a parallel implementation that can bypass large amount of calculation steps while yield the same output of the corresponding serial algorithm.  Any algorithm in which the complexity of verification is faster than the complexity of the solution [Shi 1995] => most algorithms!

34  CPU = “Central Processing Unit”  GPU = “Graphics Processing Unit”  What’s the difference?

35  Why do we classify hardware?  In order to program a parallel computer, you have to understand the hardware very well.  The basic classification is Flynn taxonomy (1966): SISD, SIMD, MIMD, MISD

36  Single Instruction Single Data Examples: MOS Technology 6502 Motorola 68000 Intel 8086

37  Single Instruction Multiple Data Examples: ILLIAC IV CM-1, -2 Intel Core, Atom NVIDIA GPU’s

38  Multiple Instruction Single Data Examples: Space shuttle computer

39  Multiple Instruction Multiple Data Examples: BBN Butterfly, Cedar, CM-5, IBM RP3, Intel Cube, Ncube, NYU Ultracomputer

40  Parallel Random Access Machine (PRAM).  Idealized SIMD parallel computing model.  Unlimited RAM’s, called Processing Units (PU).  RAM’s operate with same instructions and synchronously.  Shared Memory unlimited, accessed in one unit time.  Shared Memory access is one of CREW, CRCW, EREW.  Communication between RAM’s is only through Shared Memory.

41  PRAM is used for specifying an algorithm and analyzing the complexity of it.  PRAM-based algorithms can be adapted to SIMD architectures.  PRAM algorithms can be converted into CUDA implementations relatively easily.

42  Parallel for loop  for P i, 1 ≤ i ≤ n in parallel do … end


Download ppt "Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program."

Similar presentations


Ads by Google