Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a.

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Conversion and Coding (12)10.
Main Index Contents 11 Main Index Contents Week 6 – Binary Trees.
Chapter 4: Divide and Conquer Master Theorem, Mergesort, Quicksort, Binary Search, Binary Trees The Design and Analysis of Algorithms.
CS 240A: Parallel Prefix Algorithms or Tricks with Trees
Parallel prefix sum computation Lecture 7. 2 Prefix sum.
Data Parallel Algorithms Presented By: M.Mohsin Butt
A Binary Tree root leaf. A Binary Tree root leaf descendent of root parent of leaf.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
VLSI Arithmetic. Multiplication A = a n-1 a n-2 … a 1 a 0 B = b n-1 b n-2 … b 1 b 0  eg)  Shift.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
1 Lecture 6 More PRAM Algorithm Parallel Computing Fall 2008.
1 CS 267 Tricks with Trees James Demmel
CS 684.
1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel
Topic Overview One-to-All Broadcast and All-to-One Reduction
Linear Recurrence Equation Example. Fibonacci Sequence: f n = f n-1 +f n-2, with f 0 = 0, f 1 = 1 or Thus, computing f n is equivalent to computing using.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …
Chapter # 5: Arithmetic Circuits
High Performance Circuit Design By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600.
Csci 136 Computer Architecture II – Constructing An Arithmetic Logic Unit Xiuzhen Cheng
Tree (new ADT) Terminology:  A tree is a collection of elements (nodes)  Each node may have 0 or more successors (called children)  How many does a.
1 Lecture 3 ENGRE 254 1/14/09. 2 Lecture 1 review Digital signals assume two values represented by “0” and “1”. Typically a “0” represents a voltage near.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Binary Adder DesignSpring Binary Adders. Binary Adder DesignSpring n-bit Addition –Ripple Carry Adder –Conditional Sum Adder –(Carry Lookahead.
Data Representation in Computer Systems. 2 Signed Integer Representation The conversions we have so far presented have involved only positive numbers.
 Lecture 2 Processor Organization  Control needs to have the  Ability to fetch instructions from memory  Logic and means to control instruction sequencing.
CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale.
Divide And Conquer A large instance is solved as follows:  Divide the large instance into smaller instances.  Solve the smaller instances somehow. 
Unrolling Carry Recurrence
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Addition, Subtraction, Logic Operations and ALU Design
HYPERCUBE ALGORITHMS-1
1 CS 151 : Digital Design Chapter 4: Arithmetic Functions and Circuits 4-3 : Binary Subtraction.
1/4 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.
Interval Trees Marco Gallotta. Problem ● Given a collection of items i, each with value V i ● Want to answer many queries of the form: How many items.
CDA3101 Recitation Section 5
Carry-Lookahead & Carry-Select Adders
MA/CSSE 473 Day 20 Finish Josephus
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Parallel Algorithms (chap. 30, 1st edition)
Linear Equations.
Exercise: Add these two single precision IEEE 754 numbers: … …0 Left number: 1.101x24 Right number: 1.011x 22= x24.
Algorithms with numbers (1) CISC4080, Computer Algorithms
Data Structures: Segment Trees, Fenwick Trees
Chapter 4: Divide and Conquer
CSE Winter 2001 – Arithmetic Unit - 1
Parallel Prefix.
Data Structures Review Session
Instructor: Prof. Chung-Kuan Cheng
Instructor: Alexander Stoytchev
EEL 3705 / 3705L Digital Logic Design
Part III The Arithmetic/Logic Unit
Design and Analysis of Algorithms
Unit –VIII PRAM Algorithms.
Instructor: Alexander Stoytchev
Instructor: Alexander Stoytchev
Instructor: Alexander Stoytchev
74LS283 4-Bit Binary Adder with Fast Carry
Low Depth Cache-Oblivious Algorithms
Carry-Lookahead & Carry-Select Adders
List Ranking Moon Jung Chung
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Non-Linear data structures
1.6) Storing Integer: 1.7) storing fraction:
Presentation transcript:

Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a 2 ) ) a 3 = a 1 ) (a 2 ) a 3 ) How to compute (a 1 ) a 2 ) …. ) a n ) in parallel in O(logn) time?

Approach 1 a0a1a2a3a4a5a6a7  [0:1]  [0:0]  [1:2]  [2:3]  [3:4]  [4:5]  [5:6]  [6:7]  [0:1]  [0:0]  [0:2]  [0:3]  [1:4]  [2:5]  [3:6]  [4:7]  [0:1]  [0:0]  [0:2]  [0:3]  [0:4]  [0:5]  [0:6]  [0:7] d=1 d=2 d=4 Assume that n = 2 k for i = 0 to k-1 for j = 0 to n-1-2 i do in parallel x[j+ 2 i ] = x[j] + x[j+ 2 i ]

How to do on Tree Architecture? for each node if there is a signal from left and right S t <- S l + S r if there is a signal R, send R to both its children if the node is a leaf and there is a signal R, X <- X + R SlSl SrSr StSt R

How to do on a Hypercube A complete binary tree can be embedded into a hypercube Simpler solution: each node computes prefix and total sum for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[j i ] if i-th bit of j = 1 sum[j ] = sum[j] + sum[j i ], where j i and j have the same binary number representation except their i-th bit, where the i-th bit of j i is the complement of the i-bit of j.

Prefix on Hypercube a0a1a2a3a4a5a6a7 for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[j i ] if i-th bit of j = 1 sum[j ] = sum[j] + sum[j i ],  [0:1]  [0:0]  [0:1]  [2:2]  [2:3]  [4:4]  [4:5]  [6:6]  [6:7] d=1 X SUM  [0:1]  [0:3]  [0:0]  [0:3]  [2:2]  [0:3]  [2:3]  [0:3]  [4:4]  [4:7]  [4:5]  [4:7]  [4:6]  [4:7] d=2 X SUM  [0:1]  [0:7]  [0:0]  [0:7]  [2:2]  [0:7]  [2:3]  [0:7]  [0:4]  [0:7]  [0:5]  [0:7]  [0:6]  [0:7] d=4 X SUM

Applications of Data Parallel Operations Any associative operations: Examples: –min, max, add –adding two binary numbers –finite state automata –radix sort –segmented prefix sum –routing packing unpacking broadcast (copy-scan) –solving recurrence equations –straight line computation (parallel arithmetic evaluation)

Adding two n bit numbers as parallel prefix a = a n-1 …. a 0 b = b n-1 …. b 0 s = a + b note that s i = a i  b i  c i-1 to compute c i define g and p as: g i = a i  b i, p i = a i  b i define  as : (g,p)  (g’,p’) = (g  (p  g’), p  p’) Then carry bit c i can be computed by: (g,p)  (g’,p’) = (g  (p  g’), p  p’) (G i, P i ) = (g i,p i )  (g i-1, p i-1 )  …  (g 0,p 0 ) and G i = c i

Hardware circuit of recursive look-ahead adder

Parsing a regular language b b cc q1q1 q2q2 q0q0  (q0,b) = q2,  (q0,c) = q1,  (q1,b) = q0,  (q1,c) = qr,  (q2,b) = qr,  (q2,c) = q0 qr: reject state q0->q2 q1->q0 q2->qr q2 q0 qr q1 qr q0 q1 qr q0 q2 q0 qr q1’ q2’ q3’ q1’ q2’ q3’ q0 q1 qr q1 qr q0 b q1’ q2’ q3’ q0 q1 qr q0 qr q2 q0 q1 qr q0 qr q2 q0 qr b c c b c

Segmented Prefix operation Segment boundary after before

Segmented Prefix computation Let  be any associative operation. For segmented operation of , define  ’ as follows:  ’ b| b a a  b | b | a | (a  b)| b Then  ’ is associative and we can compute segmented operation in O(logn) time.

Enumerating Data = [ ] active procs = [ ] enumerated = [0 x 1 2 x x 3 x 4 0]

packing data = [ ] active procs = [ ] enumerated = [0 x 1 2 x x 3 x 4 x] packed data =[ x x x x x]

Packing and Unpacking on Hypercube Packing adjust bit 0 adjust bit 1 adjust bit 2... adjust bit k-1 Unpacking adjust bit k-1 adjust bit k-2... adjust bit 1 adjust bit 0 How about in the order of adjust bit 0, 1,..., k-1 for packing?

Unpacking Address data = [ x x x x x] active procs = [ ] enumerated = [0 x 1 2 x x 3 x 4 x] destination =[ x x x x x] unpacked data =[6 x 2 3 x x 5 x 9 x]

Copy Scan (broadcast) address data = [ ] segmented bit = [ ] result = [ ]

Radix Sort for j = k-1 to 0 // x has k bits for all i in [0.. n-1] do parallel { if j-th bit of x[i] is 0 { y[i] = enumerate c = count } if j-th bit of x[i] is 1 y [i] <- enumerate + c x [y[i]] = x [i] } Radix sort another code for j = k-1 to 0 // x has k bits for all i in [0.. n-1] do parallel { pack left x[i] if j-th bit of x[i] pack right x[i] if j-th bit of x[i] }

Quick Sort 1. Pick a pivot p 2. Broadcast p 3. For all PE i, compare A[i] with p { if A[i] <p, pack left A[i] in the segment if A[i] >= p, pack right A[i] in the segment } 4. Mark the segment boundary 5. Each segment, quick sort recursively

Solving Linear Recurrence Equations f n =a n-1 f n-1 + a n-2 f n-2 f n f n-1

Pointer Jumping and Tree Computation How to compute a prefix on a linked list? If NEXT[i] != NILL then X[i] <- X[i] + X[NEXT[i]] NEXT[i] <- NEXT[NEXT[i]] How to make order?

Application: Tree computation Pre-order numbering Each node Leaf node 1 1 Can be applied to in order, post order number of children, depth etc. Bi-component, etc also

Recurrence Equation Example: LU decomposition on a triangular matrix