Data Parallel Algorithms Presented By: M.Mohsin Butt 201103010.

Slides:



Advertisements
Similar presentations
Parallel Algorithms.
Advertisements

Basic Communication Operations
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Instruction Set Design
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Linked Lists Compiled by Dr. Mohammad Alhawarat CHAPTER 04.
Elementary Data Types Prof. Alamdeep Singh. Scalar Data Types Scalar data types represent a single object, i.e. only one value can be derived. In general,
PRAM (Parallel Random Access Machine)
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Computational Complexity 1. Time Complexity 2. Space Complexity.
Vector Machines Model for Parallel Computation Bryan Petzinger Adv. Theory of Computing.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Elementary Data Types Scalar Data Types Numerical Data Types Other
Topic Overview One-to-All Broadcast and All-to-One Reduction
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Introduction to Parallel Processing Ch. 12, Pg
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Fundamentals of Python: From First Programs Through Data Structures Chapter 14 Linear Collections: Stacks.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Custom Templatized Data Structures.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Machine Instruction Characteristics
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
1 Models of Computation o Turing Machines o Finite store + Tape o transition function: o If in state S1 and current cell is a 0 (1) then write 1 (0) and.
Generic Programming Using the C++ Standard Template Library.
Pointers OVERVIEW.
1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
Sorting Algorithms Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
CS342 Data Structures End-of-semester Review S2002.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Chapter 1 Introduction Study Goals: Master: the phases of a compiler Understand: what is a compiler Know: interpreter,compiler structure.
Parallel Computing.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
INFAnt: NFA Pattern Matching on GPGPU Devices Author: Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, Riccardo Sisto Publisher: ACM SIGCOMM 2010 Presenter:
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Processor Architecture
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
1 Radix Sort. 2 Classification of Sorting algorithms Sorting algorithms are often classified using different metrics:  Computational complexity: classification.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Basic Communication Operations Carl Tropper Department of Computer Science.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Parallel Computing Presented by Justin Reschke
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Lecture 2: Intro to the simd lifestyle and GPU internals
Sorting.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
CSCE Fall 2013 Prof. Jennifer L. Welch.
Data Parallel Algorithms
Unit –VIII PRAM Algorithms.
CSCE Fall 2012 Prof. Jennifer L. Welch.
6- General Purpose GPU Programming
Presentation transcript:

Data Parallel Algorithms

Presented By: M.Mohsin Butt

Data Parallelism Single thread of control operating on large set of data. Typically SIMD(single instruction multiple data) style architecture. O(N) processors can solve problem of size N in O(log N) time. Optimizes running time and throughput.

Connection Machine system having an array of processors, each with 4096 bits of memory. Connection machine processors are connected to a front end VAX or Symbolics 3600 processor. Processor array is connected to the memory bus of the front end so that the local processor memories can be random accessed by the front end. The front end can issue commands that cause many parts of the memory to be operated upon simultaneously. Extends the instruction set of the front end processor. Test Machine

Control part of the program is executed on front end. Processor Array executes commands in SIMD fashion. Each processor has state bits called context flags which are used by front end for conditional instruction execution (e.g Even,ODD addition) Some unconditional instructions are executed by every array processor regardless of its state. (e.g saving and restoring context, AND OR NOT of context set) Any Processor can communicate with any other processor in unit time. Test Machine

(Pointer Based communication and Virtual Processors) Connection Machine allows pointer-based communication. Communication implemented via SEND instruction. It’s like store indirect allowing each processor to store anywhere in the memory. Programming Model is abstracted and programs are described in terms of virtual processors. The front end also sees only virtual processors. This allows programs portability. In actual implementation hardware processors are multiplexed by controller. Processor-cons can be used for allocating a memory along with a processor. Connection Machine allows pointer-based communication. Communication implemented via SEND instruction. It’s like store indirect allowing each processor to store anywhere in the memory. Programming Model is abstracted and programs are described in terms of virtual processors. The front end also sees only virtual processors. This allows programs portability. In actual implementation hardware processors are multiplexed by controller. Processor-cons can be used for allocating a memory along with a processor.

Data Parallel Algorithms Following algorithms are implemented in parallel to get useful results from this SIMD architecture. Sum of an Array of Numbers. All Partial Sums of an Array. Radix Sort. Parsing a Regular Language. Finding the End of a Linked List. All partial Sums of a Linked List Matching Up Elements of Two Linked Lists. Following algorithms are implemented in parallel to get useful results from this SIMD architecture. Sum of an Array of Numbers. All Partial Sums of an Array. Radix Sort. Parsing a Regular Language. Finding the End of a Linked List. All partial Sums of a Linked List Matching Up Elements of Two Linked Lists.

Sum of An Array of Numbers Sum of n numbers is computed by organizing the addends in a binary tree form in O(log N) time.

All Partial Sums of An Array. Computes sums over all prefixes of an array by utilizing all the processor cores in each step efficiently in O(log N) time.

Radix Sort Count and Enumerate active processors. Count determines how many processors are active. Enumerate assigns a distinct integer to each active processor. To Count active processors each, every processor unconditionally determines its context flag and compute integer 1 id the flag is set. To enumerate the active processors, every processor unconditionally computes 1 or 0 in the same manner but performs an unconditional sum-prefix calculation, resulting in enumerating the processors. These operations take 200 us each on the test machine. Count and Enumerate active processors. Count determines how many processors are active. Enumerate assigns a distinct integer to each active processor. To Count active processors each, every processor unconditionally determines its context flag and compute integer 1 id the flag is set. To enumerate the active processors, every processor unconditionally computes 1 or 0 in the same manner but performs an unconditional sum-prefix calculation, resulting in enumerating the processors. These operations take 200 us each on the test machine.

Radix Sort This implementation of radix sort requires logarithmic number of passes. Each pass examine one bit of each key. All the keys that have 0 as LSB are counted (c) and than enumerated in order to assign them distinct integers yk ranging from 0 to c-1. All the keys that have 1 are then enumerated and c is added to the result. The values of yk are used to permute the keys so that all keys with 0 as LSB precede all keys with 1 as LSB. The process is repeated for the remaining log N bits. This implementation of radix sort requires logarithmic number of passes. Each pass examine one bit of each key. All the keys that have 0 as LSB are counted (c) and than enumerated in order to assign them distinct integers yk ranging from 0 to c-1. All the keys that have 1 are then enumerated and c is added to the result. The values of yk are used to permute the keys so that all keys with 0 as LSB precede all keys with 1 as LSB. The process is repeated for the remaining log N bits.

Radix Sort

Parsing a Regular Language. Uses parallel prefix computation. A string of characters such as if x c = “) x) ; can be broken in tokens This is called lexing a string. Any language of this type can be parsed by using a finite state automaton that begins in a certain stage and ends in a certain stage. Uses parallel prefix computation. A string of characters such as if x c = “) x) ; can be broken in tokens This is called lexing a string. Any language of this type can be parsed by using a finite state automaton that begins in a certain stage and ends in a certain stage.

Parsing a Regular Language. Here: N is Initial State. A is Start of an Alphabet Token. Z is Continuation of an alphabetic token. * is single special character token(e.g +,-,*,=) = is an = that follows. Q is the double quotes that start a string S is a character within a string. E is the double quote that ends a string. E.g: applying string Y”+= to state Z gives. Z(Y”+=) = ((ZY)”+=) =(Z”+=) = (Q+=) = (S=) =S Here: N is Initial State. A is Start of an Alphabet Token. Z is Continuation of an alphabetic token. * is single special character token(e.g +,-,*,=) = is an = that follows. Q is the double quotes that start a string S is a character within a string. E is the double quote that ends a string. E.g: applying string Y”+= to state Z gives. Z(Y”+=) = ((ZY)”+=) =(Z”+=) = (Q+=) = (S=) =S

Parsing a Regular Language. A function from state to state can be represented as a one dimensional array indexed by states whose elements are states. The Parallel algorithm is as follow. Replace every character in the string with the array representation of its state-to-state function. Perform a parallel-prefix operation. The combining function is the composition of arrays as described above. The net effect is that, after this step, every character c of the original string has been replaced by an array representing the state-to-state function for that prefix of the original string that ends at (and includes) c. Use the initial automaton state (N in our example) to index into all these arrays. Now every character has been replaced by the state the automaton would have after that character. A function from state to state can be represented as a one dimensional array indexed by states whose elements are states. The Parallel algorithm is as follow. Replace every character in the string with the array representation of its state-to-state function. Perform a parallel-prefix operation. The combining function is the composition of arrays as described above. The net effect is that, after this step, every character c of the original string has been replaced by an array representing the state-to-state function for that prefix of the original string that ends at (and includes) c. Use the initial automaton state (N in our example) to index into all these arrays. Now every character has been replaced by the state the automaton would have after that character.

Finding the End of a Serially Linked List Assume each cell has an extra pointer called chum. Each processor sets its chum to its next component. Next, each processor repeatedly replaces its chum by its chum’s chum until its NULL. Assume each cell has an extra pointer called chum. Each processor sets its chum to its next component. Next, each processor repeatedly replaces its chum by its chum’s chum until its NULL.

All partial Sums of a Linked List Partial sums of a linked list are computed by the same technique Both of the algorithms run in log N time. Partial sums of a linked list are computed by the same technique Both of the algorithms run in log N time.

Matching up Elements of Two Linked Lists Second list is called friends list. An extra pointer in each cell called friends pointer. Friend pointers are initialized to NULL. First cells of the two lists are introduced, so they become friends. The remaining part is similar to previous logarithmic chums game, but at every iteration, a cell that has both a chum and a friend will cause it’s friends chum to become its chum’s friend. The extra cell at the end of the longer list has no friend. Component wise addition and multiplication of two vectors can be performed in logarithmic running time. Second list is called friends list. An extra pointer in each cell called friends pointer. Friend pointers are initialized to NULL. First cells of the two lists are introduced, so they become friends. The remaining part is similar to previous logarithmic chums game, but at every iteration, a cell that has both a chum and a friend will cause it’s friends chum to become its chum’s friend. The extra cell at the end of the longer list has no friend. Component wise addition and multiplication of two vectors can be performed in logarithmic running time.

Matching up Elements of Two Linked Lists

Other Uses. Recursive Data Parallelism. Region Labeling. Recursive Data Parallelism. Region Labeling.

Conclusion. In problems involving large data sets, parallelism to be gained by concurrently operating on multiple data elements is greater than the parallelism to be gained by concurrently executing lines of code. MIMD can still be effective if cost of duplication of data is high as compared to cost of synchronization. Even in recent years this type of general purpose computing is supported on various Graphics processing units (e.g NVIDIA CUDA Architecture). In problems involving large data sets, parallelism to be gained by concurrently operating on multiple data elements is greater than the parallelism to be gained by concurrently executing lines of code. MIMD can still be effective if cost of duplication of data is high as compared to cost of synchronization. Even in recent years this type of general purpose computing is supported on various Graphics processing units (e.g NVIDIA CUDA Architecture).