Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Parallel Algorithms Presented By: M.Mohsin Butt 201103010.

Similar presentations


Presentation on theme: "Data Parallel Algorithms Presented By: M.Mohsin Butt 201103010."— Presentation transcript:

1

2 Data Parallel Algorithms

3 Presented By: M.Mohsin Butt 201103010

4 Data Parallelism Single thread of control operating on large set of data. Typically SIMD(single instruction multiple data) style architecture. O(N) processors can solve problem of size N in O(log N) time. Optimizes running time and throughput.

5 Connection Machine system having an array of 65536 processors, each with 4096 bits of memory. Connection machine processors are connected to a front end VAX or Symbolics 3600 processor. Processor array is connected to the memory bus of the front end so that the local processor memories can be random accessed by the front end. The front end can issue commands that cause many parts of the memory to be operated upon simultaneously. Extends the instruction set of the front end processor. Test Machine

6 Control part of the program is executed on front end. Processor Array executes commands in SIMD fashion. Each processor has state bits called context flags which are used by front end for conditional instruction execution (e.g Even,ODD addition) Some unconditional instructions are executed by every array processor regardless of its state. (e.g saving and restoring context, AND OR NOT of context set) Any Processor can communicate with any other processor in unit time. Test Machine

7 (Pointer Based communication and Virtual Processors) Connection Machine allows pointer-based communication. Communication implemented via SEND instruction. It’s like store indirect allowing each processor to store anywhere in the memory. Programming Model is abstracted and programs are described in terms of virtual processors. The front end also sees only virtual processors. This allows programs portability. In actual implementation hardware processors are multiplexed by controller. Processor-cons can be used for allocating a memory along with a processor. Connection Machine allows pointer-based communication. Communication implemented via SEND instruction. It’s like store indirect allowing each processor to store anywhere in the memory. Programming Model is abstracted and programs are described in terms of virtual processors. The front end also sees only virtual processors. This allows programs portability. In actual implementation hardware processors are multiplexed by controller. Processor-cons can be used for allocating a memory along with a processor.

8 Data Parallel Algorithms Following algorithms are implemented in parallel to get useful results from this SIMD architecture. Sum of an Array of Numbers. All Partial Sums of an Array. Radix Sort. Parsing a Regular Language. Finding the End of a Linked List. All partial Sums of a Linked List Matching Up Elements of Two Linked Lists. Following algorithms are implemented in parallel to get useful results from this SIMD architecture. Sum of an Array of Numbers. All Partial Sums of an Array. Radix Sort. Parsing a Regular Language. Finding the End of a Linked List. All partial Sums of a Linked List Matching Up Elements of Two Linked Lists.

9 Sum of An Array of Numbers Sum of n numbers is computed by organizing the addends in a binary tree form in O(log N) time.

10 All Partial Sums of An Array. Computes sums over all prefixes of an array by utilizing all the processor cores in each step efficiently in O(log N) time.

11 Radix Sort Count and Enumerate active processors. Count determines how many processors are active. Enumerate assigns a distinct integer to each active processor. To Count active processors each, every processor unconditionally determines its context flag and compute integer 1 id the flag is set. To enumerate the active processors, every processor unconditionally computes 1 or 0 in the same manner but performs an unconditional sum-prefix calculation, resulting in enumerating the processors. These operations take 200 us each on the test machine. Count and Enumerate active processors. Count determines how many processors are active. Enumerate assigns a distinct integer to each active processor. To Count active processors each, every processor unconditionally determines its context flag and compute integer 1 id the flag is set. To enumerate the active processors, every processor unconditionally computes 1 or 0 in the same manner but performs an unconditional sum-prefix calculation, resulting in enumerating the processors. These operations take 200 us each on the test machine.

12 Radix Sort This implementation of radix sort requires logarithmic number of passes. Each pass examine one bit of each key. All the keys that have 0 as LSB are counted (c) and than enumerated in order to assign them distinct integers yk ranging from 0 to c-1. All the keys that have 1 are then enumerated and c is added to the result. The values of yk are used to permute the keys so that all keys with 0 as LSB precede all keys with 1 as LSB. The process is repeated for the remaining log N bits. This implementation of radix sort requires logarithmic number of passes. Each pass examine one bit of each key. All the keys that have 0 as LSB are counted (c) and than enumerated in order to assign them distinct integers yk ranging from 0 to c-1. All the keys that have 1 are then enumerated and c is added to the result. The values of yk are used to permute the keys so that all keys with 0 as LSB precede all keys with 1 as LSB. The process is repeated for the remaining log N bits.

13 Radix Sort

14 Parsing a Regular Language. Uses parallel prefix computation. A string of characters such as if x c = “) x) ; can be broken in tokens This is called lexing a string. Any language of this type can be parsed by using a finite state automaton that begins in a certain stage and ends in a certain stage. Uses parallel prefix computation. A string of characters such as if x c = “) x) ; can be broken in tokens This is called lexing a string. Any language of this type can be parsed by using a finite state automaton that begins in a certain stage and ends in a certain stage.

15 Parsing a Regular Language. Here: N is Initial State. A is Start of an Alphabet Token. Z is Continuation of an alphabetic token. * is single special character token(e.g +,-,*,=) = is an = that follows. Q is the double quotes that start a string S is a character within a string. E is the double quote that ends a string. E.g: applying string Y”+= to state Z gives. Z(Y”+=) = ((ZY)”+=) =(Z”+=) = (Q+=) = (S=) =S Here: N is Initial State. A is Start of an Alphabet Token. Z is Continuation of an alphabetic token. * is single special character token(e.g +,-,*,=) = is an = that follows. Q is the double quotes that start a string S is a character within a string. E is the double quote that ends a string. E.g: applying string Y”+= to state Z gives. Z(Y”+=) = ((ZY)”+=) =(Z”+=) = (Q+=) = (S=) =S

16 Parsing a Regular Language. A function from state to state can be represented as a one dimensional array indexed by states whose elements are states. The Parallel algorithm is as follow. Replace every character in the string with the array representation of its state-to-state function. Perform a parallel-prefix operation. The combining function is the composition of arrays as described above. The net effect is that, after this step, every character c of the original string has been replaced by an array representing the state-to-state function for that prefix of the original string that ends at (and includes) c. Use the initial automaton state (N in our example) to index into all these arrays. Now every character has been replaced by the state the automaton would have after that character. A function from state to state can be represented as a one dimensional array indexed by states whose elements are states. The Parallel algorithm is as follow. Replace every character in the string with the array representation of its state-to-state function. Perform a parallel-prefix operation. The combining function is the composition of arrays as described above. The net effect is that, after this step, every character c of the original string has been replaced by an array representing the state-to-state function for that prefix of the original string that ends at (and includes) c. Use the initial automaton state (N in our example) to index into all these arrays. Now every character has been replaced by the state the automaton would have after that character.

17 Finding the End of a Serially Linked List Assume each cell has an extra pointer called chum. Each processor sets its chum to its next component. Next, each processor repeatedly replaces its chum by its chum’s chum until its NULL. Assume each cell has an extra pointer called chum. Each processor sets its chum to its next component. Next, each processor repeatedly replaces its chum by its chum’s chum until its NULL.

18 All partial Sums of a Linked List Partial sums of a linked list are computed by the same technique Both of the algorithms run in log N time. Partial sums of a linked list are computed by the same technique Both of the algorithms run in log N time.

19 Matching up Elements of Two Linked Lists Second list is called friends list. An extra pointer in each cell called friends pointer. Friend pointers are initialized to NULL. First cells of the two lists are introduced, so they become friends. The remaining part is similar to previous logarithmic chums game, but at every iteration, a cell that has both a chum and a friend will cause it’s friends chum to become its chum’s friend. The extra cell at the end of the longer list has no friend. Component wise addition and multiplication of two vectors can be performed in logarithmic running time. Second list is called friends list. An extra pointer in each cell called friends pointer. Friend pointers are initialized to NULL. First cells of the two lists are introduced, so they become friends. The remaining part is similar to previous logarithmic chums game, but at every iteration, a cell that has both a chum and a friend will cause it’s friends chum to become its chum’s friend. The extra cell at the end of the longer list has no friend. Component wise addition and multiplication of two vectors can be performed in logarithmic running time.

20 Matching up Elements of Two Linked Lists

21 Other Uses. Recursive Data Parallelism. Region Labeling. Recursive Data Parallelism. Region Labeling.

22 Conclusion. In problems involving large data sets, parallelism to be gained by concurrently operating on multiple data elements is greater than the parallelism to be gained by concurrently executing lines of code. MIMD can still be effective if cost of duplication of data is high as compared to cost of synchronization. Even in recent years this type of general purpose computing is supported on various Graphics processing units (e.g NVIDIA CUDA Architecture). In problems involving large data sets, parallelism to be gained by concurrently operating on multiple data elements is greater than the parallelism to be gained by concurrently executing lines of code. MIMD can still be effective if cost of duplication of data is high as compared to cost of synchronization. Even in recent years this type of general purpose computing is supported on various Graphics processing units (e.g NVIDIA CUDA Architecture).


Download ppt "Data Parallel Algorithms Presented By: M.Mohsin Butt 201103010."

Similar presentations


Ads by Google