Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Slides:



Advertisements
Similar presentations
Chapter 6 Queues and Deques.
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS252: Systems Programming Ninghui Li Program Interview Questions.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Elementary Data Structures: Part 2: Strings, 2D Arrays, Graphs
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Copyright © 2014, 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with C++ Early Objects Eighth Edition by Tony Gaddis,
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
Copyright © 2010 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with Programming Logic & Design Second Edition by Tony Gaddis.
Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable supply of ready processes to.
Reference: Message Passing Fundamentals.
Priority queues CS310 – Data Structures Professor Roch Weiss, Chapter 6.9, 21 All figures marked with a chapter and section number are copyrighted © 2006.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 ES 314 Advanced Programming Lec 3 Sept 8 Goals: complete discussion of pointers discuss 1-d array examples Selection sorting Insertion sorting 2-d arrays.
Scalable Algorithmic Techniques (Ch. 4-5 Lin Snyder Text) Johnnie W. Baker Feb
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Priority Queues, Heaps & Leftist Trees
CSE Lectures 22 – Huffman codes
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
1 HEAPS & PRIORITY QUEUES Array and Tree implementations.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Limits.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 10: Trees Data Abstraction & Problem Solving with C++
Chapter 5: Programming Languages and Constructs by Ravi Sethi Activation Records Dolores Zage.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Dynamic Memory Allocation 9.8.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 2: Recursion: The Mirrors Data Abstraction & Problem Solving.
Copyright © 2010 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with Programming Logic & Design Second Edition by Tony Gaddis.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-1 Two allocations of a 16X16 array to 16 processes: (a) 2-dimensional blocks;
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.
Data Structure Introduction.
Data Structures Trees Phil Tayco Slide version 1.0 Apr. 23, 2015.
Union-find Algorithm Presented by Michael Cassarino.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Trees Chapter 23 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Chapter 13 Priority Queues. 2 Priority queue A stack is first in, last out A queue is first in, first out A priority queue is least-in-first-out The “smallest”
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
HYPERCUBE ALGORITHMS-1
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with Programming Logic & Design Third Edition by Tony Gaddis.
B/B+ Trees 4.7.
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Chapter 8: Data Abstractions
Chapter 8: ZPL and Other Global View Languages
Heaps Chapter 10 has several programming projects, including a project that uses heaps. This presentation shows you what a heap is, and demonstrates two.
Heaps Chapter 10 has several programming projects, including a project that uses heaps. This presentation shows you what a heap is, and demonstrates two.
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Heaps Chapter 11 has several programming projects, including a project that uses heaps. This presentation shows you what a heap is, and demonstrates.
Heaps Chapter 11 has several programming projects, including a project that uses heaps. This presentation shows you what a heap is, and demonstrates.
Parallel Programming in C with MPI and OpenMP
Heaps Chapter 10 has several programming projects, including a project that uses heaps. This presentation shows you what a heap is, and demonstrates two.
Presentation transcript:

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder Chapter 5: Scalable Algorithmic Techniques

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Scalable program construction Can be improved by larger problem size Focus on data parallel 5-2

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Ideal parallel computation Large blocks of independent computation BOINC projects at Berkeley SETHI project These kinds of projects are atypical 5-3

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Important principle Parallel programs are more scalable when they emphasize blocks of computation– typically the larger the block the better – that minimize the inter-thread dependencies. 5-4

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Schwarz’s algorithm Tree should be used to connect processes rather than items Given P < n –Encode as in 1.3 –Each process add n/P items locally then combine the P intermediate sums with a P-leaf tree that connects the processes. All processes are working directly on the problem 5-5

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-6 Figure 5.1 Schwartz - Process-induced tree. Each process computes locally on a sequence of values (heavy lines), and then combines the results pair-wise, inducing a tree; notice that process 0 participates at each level in the tree.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-7 Figure 5.2 Schwartz algorithm inducing the tree of Figure 5.1. Line 8 loads the locally computed value into the tree; line 14 performs the summation when both operands are available. Threads exit when they have nothing left to do.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Advocate use of reduce and scan Even though not in programming languages Code as functions High level Conveys information about program logic 5-8

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Reduce/Scan are common/important Reduce –Combine a set of values to compare or combine results Scan –Parallel prefix –Performs a sequential operation in parts –Carries intermediate results 5-9

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Kinds of Scans Given A = {2, 4, 6} Inclusive – +\A = {2, 6, 12} –Used by Peril-L Exclusive – +\A = {0, 2, 6} –First item is the identity item for the set 5-10

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Reduces 2 nd smallest array element –Use smallest and 2ndsmallest –If array value is smaller update each accordingly Histogram – compute with k intervals –Use min and max reduce to find smallest/largest –Initialize k element array, hist, to 0’s –Iterate through data counting interval it belongs 5-11

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples (cont.) Length of longest run of consecutive 1’s –current = 0, longest = 0 –current is current run of 1’s –Answer is max(current, longest) Index of first x –Create 2 element temp array –temp[0] = x, temp[1] = +-infinity –Iterate looking for x, keep smaller of saved index and found index 5-12

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Basic structure of reduce and scan Local variable tally stores intermediate results Functions –init() initializes tally –accum() performs local accumulation –combine() composes intermediate tally results and passes them to parent –x-gen() takes global result to generate final answer Will vary for scan and reduce 5-13

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Example +/ A (reduce) init() – tally = 0 accum(tally, val) – tally = tally + A[i] combine(left, right) – adds left and right tally values and passes tally to the parent reduce-gen(root) has nothing to do, returns its argument as the global result logic shown in next slide 5-14

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-15 Figure 5.3

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-16 Figure 5.4 Peril-L code for the generalized reduce logic. Notice the sites for the four component functions. The tree combining relies on the use of full/empty memory, which drives the tree accumulation. As threads complete their roles in the combining tree, they terminate.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-17 Figure 5.5 The four generalized reduce functions implementing secondMin reduce. The tally is a two-element struct.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Generalized Scan Like reduce except after combining the intermediate results are passed down the combining tree. The value that each process receives from its parent is the tally for the values that are left of the parent’s leftmost leaf. 5-18

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Scan Team Standings Keep the longest sequence of 1s Index of Last Occurrence 5-19

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-20 Figure 5.6

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-21 Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44).

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-22 Figure 5.7 Generalized scan program. The down sweep of the tally values, beginning on line 35, distributes intermediate results to all threads to compute the final result (line 44). (cont.)

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples of Scan Given array A of int 1, …, k lastOccurrence \ A Returns position i the index of the most recent occurrence of A[i] accum stores I in tally [j], last occurrence Combine takes the max of each element Scan generator reprocess the block of data using ptally as its initial value 5-23

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-24 Figure 5.8 Customized scan functions to return the index of the last occurrence of the element in the ith operand position; the tally is a globally allocated array of k elements.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Assigning work to processes statically Block allocations –Exploit locality –Better than complete rows –Yields less communication 4x4 => 16 edge elements 16 element row => 2*16 = 32 edge elements 5-25

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-26 Figure 5.9

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Overlap Regions Stencil computation – reference neighbor elements Allocate extra space for neighbors 5-27

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-28 Figure 5.10

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Cyclic and block cyclic allocations May result in poor load balance when work is not proportional to the amount of data. Processes that own black and white portions have less work to do After 25% is done, 7 processes have nothing to do The last 25% is done just by P

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-30 Figure 5.11

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Solution – use cyclic distribution Allocate elements to processes in a round-robin fashion Cyclic allocation balance hot spots Small block size will incur overhead with communication with neighbors Small blocks do not use locality Size of blocks must be carefully determined 5-31

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-32 Figure 5.12 Illustration of a cyclic distribution of an 8 × 8 array onto five processes.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-33 Figure 5.13 Block-cyclic allocation of 3 × 2 blocks to a 14 × 14 array distributed to four processes (colors).

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-34 Figure 5.14 The block-cyclic allocation of Figure 5.13 midway through the computation; the blocks to the right summarize the active values for each process.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Julia sets need load balancing z n+1 =z 2 n + c c is complex coefficient to determine the shape 5-35

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-36 Figure 5.15 Julia set generated from the site

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-37 Figure 5.16 Example of an unstructured grid representing the pressure distribution on two airfoils. Image from

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Assigning work dynamically Work queue –Data structure for dynamically assigning work to threads or processes –Tasks added at one end and removed from other –Example is Collatz Conjecture (in text) –Example of producer/consumer 5-38

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-39 Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-40 Figure 5.17 Code for computing the expansion factor for the Collatz Conjecture (cont.).

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Hoard memory allocator Solves problem of memory allocated in 1 process and freed in another Principles –Limit local memory usage –Manage memory in large blocks p

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Trees Challenges –Local pointers –They are dynamic which may cause communication issues –Irregular structure challenges reasoning about communication and load balancing 5-42

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-43 Figure 5.18 Cap allocation for a binary tree on P = 8 processes. Each process is allocated one of the leaf subtrees, along with a copy of the cap (shaded).

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-44 Figure 5.19 Logical tree representations: (a) a binary tree where P = 8; (b) a binary tree where P = 6.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-45 Figure 5.20 Enumerating the Tic-Tac-Toe game tree; a process is assigned to search the games beginning with each of the four initial move sequences.