Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Instructor ◦ Dan Stevenson  Office: P 136   Course Web Site: ◦

Similar presentations

Presentation on theme: " Instructor ◦ Dan Stevenson  Office: P 136   Course Web Site: ◦"— Presentation transcript:


2  Instructor ◦ Dan Stevenson  Office: P 136   Course Web Site: ◦

3  Regarding HELP with course materials and assignments ◦ Come to office hours – Phillips 136  TIME TBD (Check website)  OR by appointment (just e-mail or call my office) ◦ Send me an e-mail:

4  Required: ◦ Michael J. Quinn  Parallel Programming in C with OpenMP and MPI  Suggested: ◦ Web tutorials during the semester

5 Final Grade :Exams (3): 40% Project : 20% Assignments (approx. every two weeks): 40%


7  What does it mean to you? ◦ Coordinating Threads ◦ Supercomputing ◦ Multi-core Processors ◦ Beowulf Clusters ◦ Cloud Computing ◦ Grid Computing ◦ Client-Server ◦ Scientific Computing  All contexts for “splitting up work” in an explicit way CS 491 – Parallel and Distributed Computing7

8  In this course, we will take mostly from the context of “Supercomputing” ◦ This is the field with the longest record of parallel computing expertise. ◦ It also has a long record of being a source for “trickle-down” technology. CS 491 – Parallel and Distributed Computing8

9  Supercomputing is the biggest, fastest computing - right this minute.  Likewise, a supercomputer is one of the biggest, fastest computers right this minute. ◦ The definition of supercomputing is, therefore, constantly changing.  A Rule of Thumb: A supercomputer is typically at least 100 times as powerful as a PC.  Jargon: Supercomputing is also known as High Performance Computing (HPC) or High End Computing (HEC) or Cyberinfrastructure (CI). 9

10 10 GFLOPs: billions of calculations per second Over recent years, supercomputers have benefitted directly from microprocessor performance gains, and have also gotten better at coordinating their efforts.

11  Jaguar – Oak Ridge National Laboratory (TN) ◦ 224162 processor cores – 1.76 PetaFLOP/second CS 491 – Parallel and Distributed Computing11

12  2008 IBM Roadrunner: 1.1Petaflops  2009 Cray Jaguar: 1.6  2010 Tiahe-1A (China): 2.6  2011 Fujitsu K (Japan): 10.5 ◦ 88,128 8-core processors -> 705,024 cores ◦ Needs power equivalent to 10,000 homes  Linpack numbers ◦ Core i7 – 2.3 Gflops ◦ Glalaxy Nexus – 97 Mflops CS 491 – Parallel and Distributed Computing12

13  Why should we care?  What useful thing actually takes a long time to run anymore? (especially long enough to warrant investing 7/8/9 figures on a supercomputer)  Important: It’s usually not about getting something done faster, but about getting a harder thing done in the same amount of time ◦ This is often referred to as capability computing CS 491 – Parallel and Distributed Computing13

14  Simulation of physical phenomena, such as ◦ Weather forecasting ◦ Galaxy formation ◦ Oil reservoir management  Data mining: finding needles of information in a haystack of data, such as: ◦ Gene sequencing ◦ Signal processing ◦ Detecting storms that might produce tornados (want forecasting, not retrocasting…)  Visualization: turning a vast sea of data into pictures that a scientist can understand ◦ Oak Ridge National Lab has a 512-core cluster devoted entirely to visualization runs 14 Tornadic Storm

15 CS 491 – Parallel and Distributed Computing15

16 16 SizeSpeed (Laptop)

17  Size: Many problems that are interesting™ can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk.  Speed: Many problems that are interesting™ would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer. 17

18  Parallelism: doing multiple things at the same time ◦ finding and coordinating this can be challenging  The tyranny of the storage hierarchy ◦ The hardware you’re running on matters ◦ Moving data around is often more expensive than actually computing something 18

19 CS 491 – Parallel and Distributed Computing19

20  The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors ◦ Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing Key questions in hardware design: 1. How do parallel processors share data and communicate? ◦ shared memory vs distributed memory 2. How are the processors connected? ◦ single bus vs network  The number of processors is determined by a combination of #1 and #2

21  Shared Memory Systems ◦ All processors share one memory address space and can access it ◦ Information sharing is often implicit  Distributed Memory Systems (AKA “Message Passing Systems”) ◦ Each processor has its own memory space ◦ All data sharing is done via programming primitives to pass messages  i.e. “Send data value to processor 3” ◦ Information sharing is always explicit

22  Processors communicate via messages that they send to each other: send and receive  This form is required for multiprocessors that have separate private memories for each processor ◦ Cray T3E ◦ “Beowolf Cluster” ◦ SETI@HOME  Note: shared memory multiprocessors can also have separate memories – they just aren’t “private” to each processor

23  Processors all operate independently, but operate out of the same logical memory.  Data structures can be read by any of the processors  To properly maintain ordering in our programs, synchronization primitives are needed! (locks/semaphores)


25  Connect several processors via a single shared bus ◦ bus bandwidth limits the number of processors ◦ local cache lowers bus traffic ◦ single memory module attached to the bus  Limited to very small systems!  Intel processors support this mode by default


27  Two most common variations: ◦ “snoopy” schemes  rely on broadcast to observe all coherence traffic  well suited for buses and small-scale systems  example: SGI Challenge or Intel x86 ◦ directory schemes  uses centralized information to avoid broadcast  scales well to large numbers of processors  example: SGI Origin/Altix

28  Basic Idea: ◦ all coherence-related activity is broadcast to all processors  e.g., on a global bus ◦ each processor monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache ◦ examples:  if another processor wishes to write to a line, you may need to “invalidate” (i.e. discard) the copy in your own cache  if another processor wishes to read a line for which you have a dirty copy, you may need to supply it  Most common approach in commercial shared- memory multiprocessors.  Protocol is a distributed algorithm: cooperating state machines ◦ Set of states, state transition diagram, actions

29  In the single bus case, the bus is used for every main memory access  In the network connected model, the network is used only for inter-process communication  There are multiple “memories” BUT that doesn’t mean that there’s separate memory spaces

30  Network-based machines do not want to use a snooping coherence protocol! ◦ Means that every memory transaction would need to be sent everywhere!  Directory-based systems use a global “Directory” to arbitrate who owns data ◦ Point-to-point communication with the directory instead of bus broadcasts ◦ The directory keeps a list of what caches have the data in question  When a write to that data occurs, all of the affected caches can be notified directly

31  Each node (processor) contains its own local memory  Each node is connected to the network via a switch  Messages hop along the ring from node to node until they reach the proper destination

32  2D grid, or mesh, of nodes  Each “inside” node has 4 neighbors ◦ “outside” nodes only have 2  If all nodes have four neighbors, then this is a 2D torus

33  Also called an n-cube  For n=2  2D cube (4 nodes  square)  For n=3  3D cube (8 nodes)  For n=4  4D cube (16 nodes)  In an n cube, all nodes have n neighbors 3 cube4 cube

34  Every node can communicate directly with every other node in only one pass  fully connected network  n nodes  n 2 switches  Therefore, extremely expensive to implement!

35  Fully connected, but requires passes thru multiple switch boxes  Less hardware required than crossbar, but contention can occur Omega network switch box

36 A simple model for categorizing computers: 4 categories: 1. SISD – Single Instruction Single Data ◦ the standard uniprocessor model 2. SIMD – Single Instruction Multiple Data ◦ Full systems that are “true” SIMD are no longer in use ◦ Many of the concepts exist in vector processing and to come extend graphics cards 3. MISD – Multiple Instruction Single Data ◦ doesn’t really make sense 4. MIMD – Multiple Instruction Multiple Data ◦ the most common model in use

37  A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time  Most well known examples are: ◦ Thinking Machines CM-1 and CM-2 ◦ MasPar MP-1 and MP-2 ◦ others  All are out of existence now  SIMD requires massive data parallelism  Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)

38  Closely related to SIMD ◦ Cray J90, Cray T90, Cray SV1, NEC SX-6 ◦ Starting to “merge” with MIMD systems  Cray X1E and upcoming systems (“Cascade”)  Use a single instruction to operate on an entire vector of data ◦ Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline ◦ Uses “vector registers” to feed a pipeline for the vector operation ◦ Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data  (Because of this, didn’t typically have caches until late 90s)

39  Multiple instructions are applied to multiple data  The multiple instructions can come from the same program, or from different programs ◦ Generally “parallel processing” implies the first  Most modern multiprocessors are of this form ◦ IBM Blue Gene, Cray T3D/T3E/XT3/4/5, SGI Origin/Altix ◦ Clusters

40 “Supercomputer Edition” CS 491 – Parallel and Distributed Computing40

41  A parallel computer built out of commodity hardware components ◦ PCs or server racks ◦ Commodity network (like ethernet) ◦ Often running a free-software OS like Linux with a low-level software library to facilitate multiprocessing  Use software to send messages between machines ◦ Standard is to use MPI (message passing interface)

42 “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean”

43  A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network  It also needs software that allows the nodes to communicate over the interconnect.  But what a cluster is … is all of these components working together as if they’re one big computer (a supercomputer)

44  nodes ◦ PCs ◦ Server rack nodes  interconnection network ◦ Ethernet (“GigE”) ◦ Myrinet (“10GigE”) ◦ Infiniband (low latency) ◦ The Internet (not really – typically called “Grid”)  software ◦ OS  Generally Linux  Redhat / CentOS / SuSE  Windows HPC Server ◦ Libraries (MPICH, PBLAS, MKL, NAG) ◦ Tools (Torque/Maui, Ganglia, GridEngine)

45 Interconnect Nodes

46 CS 491 – Parallel and Distributed Computing46

47  At the high end, many supercomputers are made with custom parts ◦ Custom backplane/network ◦ Custom/Reconfigurable processors ◦ Extreme Custom cooling ◦ Custom memory system  Examples: ◦ IBM Blue Gene ◦ Cray XT4/5/6 ◦ SGI Altix CS 491 – Parallel and Distributed Computing47


49  In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.  He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.  It turns out that computer speed was roughly proportional to the number of transistors per unit area.  Moore wrote a paper about this concept, which became known as “Moore’s Law.” 49

50 50 GFLOPs: billions of calculations per second

51 51 Year log(Speed) CPU

52 52 Year log(Speed) CPU Network Bandwidth

53 53 Year log(Speed) CPU Network Bandwidth RAM

54 54 Year log(Speed) CPU Network Bandwidth RAM 1/Network Latency Patterson: “In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 or 1.4”


56  Pentium 4 Core Duo T2400 1.83 GHz w/2 MB L2 Cache (“Yonah”)  2 GB (2048 MB) 667 MHz DDR2 SDRAM  100 GB 7200 RPM SATA Hard Drive  DVD+RW/CD-RW Drive (8x)  1 Gbps Ethernet Adapter  56 Kbps Phone Modem 56 Dell Latitude D620 [4]

57  Registers  Cache memory  Main memory (RAM)  Hard disk  Removable media (CD, DVD etc)  Internet 57 Fast, expensive, few Slow, cheap, a lot

58  We want to have lots of memory for our processor: ◦ LC2K needs 2 16 words of memory ( ~ 256 KB) ◦ MIPS needs 2 32 bytes of memory ( ~ 4 GB ) ◦ x86-64 needs 2 64 bytes of memory ( ~ 16 exabytes )  What are our choices? ◦ SRAM, DRAM, Magnetic Disk, paper?

59  On-chip memory ◦ Fabricated in the same technology as the processor  About 2-10 ns access (depending on size) ◦ Decoders are big ◦ Array are big  It will cost LOTS of money ◦ SRAM costs $10 per megabyte  $2.50 for LC2K  $40,960 for MIPS  $175 trillion for x86-64

60  About 50 ns access ◦ Why build a fast processor that stalls for dozens of cycles on each memory load?  Still costs lots of money for new machines ◦ DRAM costs $0.10 per megabyte  < $0.01 for LC2K  $400 for MIPS  $2 trillion for x86-64

61  About 10,000,000 ns access (snore!) ◦ We could have stopped with the Intel 4004  Costs are pretty reasonable ◦ Disk storage costs ~ $0.0003 per megabyte  Basically free for LC2K  $1.50 for MIPS  $66 billion for x86-64

62  About 100,000,000,000 ns access ◦ Time to load tape and wind it to the right position ◦ Faster than chiseling it on a stone tablet  Costs are pretty reasonable ◦ Tape storage costs $0.00017 per megabyte (about ½ the cost of disk)  Basically free for LC2K  $0.80 for MIPS  $35 billion for x86-64

63  About 50,000,000 ns access (about 5-10x Hard disks) ◦ Depends mostly on seeking out the data. ◦ Writing to this media is much slower.  Costs are pretty reasonable ◦ Disk storage costs $0.00002 per megabyte  Basically free for LC2K  $0.08 for MIPS  $400 million for x86-64

64  Use a small array of SRAM ◦ Big enough to hold whatever you use most often ◦ Small means fast! ◦ Small means cheap!  Use a larger amount of DRAM ◦ And hope that you rarely have to use it  Use a really big amount of Disk storage ◦ Disks are getting cheaper at a faster rate than we fill them up with data (for most people)  Don’t try to buy 2 64 bytes of anything ◦ It would take decades to format it anyway!

65  Use a small array of SRAM ◦ For the CACHE (hopefully for most accesses)  Use a bigger amount of DRAM ◦ For the Main memory  Use a really big amount of Disk storage ◦ For the Virtual memory (i.e. everything else)

66 Cache Main Memory Disk Storage CostLatencyAccess Freq. CPU

67  Hungry! must eat! ◦ Option 1: go to refrigerator  Found  eat!  Latency = 1 minute ◦ Option 2: go to store  Found  purchase, take home, eat!  Latency = 20-30 minutes ◦ Option 3: grow food!  Plant, wait … wait … wait …, harvest, eat!  Latency = ~250,000 minutes (~ 6 months)  Crazy fact: ratio of growing food:going to the store = 10,000 ratio of disk access:DRAM access = 200,000

68  The Architectural view of memory is: ◦ What the machine language sees ◦ Memory is just a big array of storage  Breaking up the memory system into different pieces – cache, main memory (made up of DRAM) and Disk storage – is not architectural. ◦ The machine language doesn’t know about it ◦ The processor may not know about it ◦ A new implementation may not break it up into the same pieces (or break it up at all).

69 69 CPU 351 GB/sec [6] 3.4 GB/sec [7] Bottleneck The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out.

70 CPU Cache is much closer to the speed of the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second! 3.4 GB/sec [7] 14.2 GB/sec (4x RAM) [7]

71 Better

72  Many scientific codes use a lot more data than can fit in cache all at once.  Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache.  So, how can you improve your cache hit rate? 72

Download ppt " Instructor ◦ Dan Stevenson  Office: P 136   Course Web Site: ◦"

Similar presentations

Ads by Google