High Performance Computing

High Performance Computing
Data-Flow Model Master Degree Program in Computer Science and Networking Academic Year Dr. Gabriele Mencagli, PhD Department of Computer Science University of Pisa High Performance Computing, G. Mencagli 21/11/2018

Contents Data Flow model Exercises with parallelism forms
Premise Data dependencies Bernstein conditions Analyzing Data-Flow graphs Exercises with parallelism forms Taxonomy of parallelism forms and their characteristics Parts of the book covered by these slides Sections 1.5, 1.6, 6.4 High Performance Computing, G. Mencagli 21/11/2018

Parallelism Paradigms: Data Flow
High Performance Computing, G. Mencagli 21/11/2018

Data Flow: Premise So far, we have studied some structured parallelism forms Pipeline Farm Map An alternative approach is the data-flow model. Its idea is to exploit parallelism based on data dependencies among different generic “operations” Computations are expressed as partially-ordered graphs of operations Depending on the level at which we apply the data-flow model, an operation can be a bulk of code (a function call, a sequence of function calls) a procedure, and so on (macro data-flow model) A very small piece of code up to the granularity of single instructions This model can also be viewed as a low-level approach to parallelism exploitation Applications of this model have been developed both in parallel programming frameworks (e.g., Intel TBB, OpenMP v4) and as an (unlucky) architectural model (non Von Neumann machines) High Performance Computing, G. Mencagli 21/11/2018

Data Dependencies Suppose to have a computation composed of two operations (function applications) 𝑪 𝟏 = 𝑭 𝟏 𝑫 𝟏 ; 𝑪 𝟐 = 𝑭 𝟐 𝑫 𝟐 The notation “;” denotes the sequential execution of the two operations. The sets 𝐷 1 and 𝐷 2 denote the domains of the two operations, whereas 𝐶 1 and 𝐶 2 are their co- domains Domain: the set of “variables” read by the operation Codomain: the set of “variables” modified by the operation We need to state the conditions under which the above computation is equivalent (produce the same result) of the following one 𝑪 𝟏 = 𝑭 𝟏 𝑫 𝟏 || 𝑪 𝟐 = 𝑭 𝟐 𝑫 𝟐 Where “||” denotes the parallel execution of the two operations There are different classes of data dependencies that may prevent the parallel execution of the two operations Depending on the implementation level of our computation, it may be possible that same dependencies have no effect (except pure data depenencies that are always valid!) High Performance Computing, G. Mencagli 21/11/2018

Bernstein Conditions 𝐶 1 ∩ 𝐷 2 =∅ 𝐷 1 ∩ 𝐶 2 =∅ 𝐶 1 ∩ 𝐶 2 =∅
Given two operations 𝑶 𝟏 :𝑪 𝟏 = 𝑭 𝟏 𝑫 𝟏 and 𝑶 𝟐 :𝑪 𝟐 = 𝑭 𝟐 𝑫 𝟐 , they can be executed in parallel if the following conditions hold Bernstein Conditions 𝐶 1 ∩ 𝐷 2 =∅ 𝐷 1 ∩ 𝐶 2 =∅ 𝐶 1 ∩ 𝐶 2 =∅ True data dependency (read-after-write) Anti-dependency (write-after-read) Ouput depencency (write-after-write) They are sufficient conditions that cannot be specified without assumptions on the implementation level (firmware, process) Some conditions can be removed at some levels Anti-dependencies can be elimited naturally at the firmware level, owing to the concept of clocked registers (see Background Appendix) Output dependencies can be eliminated in a purely functional language where we do not have assignment of variables and side effects High Performance Computing, G. Mencagli 21/11/2018

Data Dependencies clocked registers R Z X
Suppose to have two operations 𝑶 𝟏 :𝑿=𝑹+𝟏; 𝑶 𝟐 :𝑹=𝒁; This is an example of write-after-read dependency (anti-dependency), i.e. the intersection between the domain of the first operation and the codomain of the second one contains the variable 𝑹 At the processes level, with a shared-variable cooperation model, these two operations cannot be executed in parallel At the firmware level, such kind of anti-depenency does not generate problems clocked registers The output of the register is updated only at after the next clock impulse, i.e. it is always stable during the whole clock cycle length ALU (+1) X R Z High Performance Computing, G. Mencagli 21/11/2018

Data-Flow Graph 𝐹 0 𝐹 1 𝐹 2 𝑏 𝑎 𝑐 𝐷 0 = 𝑎,𝑏 , 𝐶 0 = 𝑦 𝑜
Given a sequential computation, a partial ordering of operations is built through the computation-wide application of the Bernstein conditions (quadratic complexity in the number of operations) The precedence relations express the true data dependences. The partial ordering is represented by a data-flow graph, which acts as the computation graph in our application of this model in the course Example: the input of a program is a tuple of three integers 𝒂,𝒃,𝒄 . The computation on the tuple produce the output 𝒚 as follows 𝒚 𝒐 = 𝑭 𝟎 𝒂,𝒃 ; 𝒚 𝟏 = 𝑭 𝟏 𝒂,𝒄 ;𝒚= 𝑭 𝟐 𝒚 𝟎 , 𝒚 𝟏 𝐹 0 𝐹 1 𝐹 2 𝑏 𝑐 𝑎 𝑦 0 𝑦 1 𝑦 Data-flow graph 𝐷 0 = 𝑎,𝑏 , 𝐶 0 = 𝑦 𝑜 𝐷 1 = 𝑎,𝑐 , 𝐶 1 = 𝑦 1 𝐷 2 = 𝑦 0 , 𝑦 1 , 𝐶 2 = 𝑦 Bernstein Conditions 𝑭 𝟎 and 𝑭 𝟏 can be executed in parallel while 𝑭 𝟐 must be executed after them High Performance Computing, G. Mencagli 21/11/2018

Exercise x (int) y (int) P Assumptions: 𝑇 𝐹 =𝑡 for all the functions
This exercise will apply different parallelism forms to the same problem by highlighting their properties and cost models Description: a process P receives a stream of integers x and, for each x, it applies the following computation P x (int) y (int) For each x: 𝑦 1 = 𝐹 1 𝑥 ; 𝑦 2 = 𝐹 2 𝑥 ; 𝑦 3 = 𝐹 3 𝑦 1 ,𝒔 ; 𝑦 4 = 𝐹 4 𝑦 2 ,𝒛 ; 𝑦= 𝐹 5 𝑦 3 , 𝑦 4 ; Assumptions: 𝑇 𝐹 =𝑡 for all the functions 𝐿 𝑐𝑜𝑚 = 𝑡 10 𝑇 𝐴 = 𝑡 2 The process P encapsulates two integer variables s and z initialized at the beginning of the execution of the process High Performance Computing, G. Mencagli 21/11/2018

Bottleneck Analysis As usual we proceed with our approach
Calculation time: 𝑇 𝑐𝑎𝑙𝑐 =5∙ 𝑇 𝐹 =5𝑡 Ideal service time of P: 𝑇 𝑖𝑑−𝑃 =𝑚𝑎𝑥 𝑇 𝑐𝑎𝑙𝑐 , 𝐿 𝑐𝑜𝑚 =𝑚𝑎𝑥 5𝑡,0.1𝑡 =5𝑡 Latency (time spent from the arrival of x to the transmission of the corresponding output y): 𝐿 𝑃 = 𝑇 𝑐𝑎𝑙𝑐 + 𝐿 𝑐𝑜𝑚 =5.1𝑡 We check the bottleneck condition by evaluating the utilization factor of the process 𝜌 𝑃 = 𝑇 𝑖𝑑−𝑃 𝑇 𝐴 = 5𝑡 𝑡 2 =𝟏𝟎>𝟏 The process is a bottleneck and the optimal parallelism degree is exactly its utilization factor, i.e. 10 We need to find a parallelization of P such that It is feasible with the computation semantics It exploits the optimal parallelism degree by removing the bottleneck Suppose for simplicity that the communication latency is 𝒕 𝟏𝟎 for sending one or two integers (the same) High Performance Computing, G. Mencagli 21/11/2018

Farm Solution E x (int) y (int) W C 𝑠,𝑧
The farm paradigm can be applied provided that the computation is a pure (stateless) function The presence of internal variables used for computing the stream elements does not always mean that farm is not applicable. In this case the state variables s and z are never modified by the code (read-only) A farm solution can be designed by replicating s and z in the 10 workers E x (int) y (int) W C . 𝑠,𝑧 𝑇 𝑖𝑑−𝐸 ≈ 𝐿 𝑐𝑜𝑚 = 𝑡 10 < 𝑇 𝐴 = 𝑡 2 𝑇 𝐹𝑎𝑟𝑚 =𝑚𝑎𝑥 𝑇 𝐴 , 𝑇 𝑖𝑑−𝐸 , 𝑇 𝑖𝑑−𝑊 10 = 𝑡 2 𝐿 𝐹𝑎𝑟𝑚 = 𝑇 𝑐𝑎𝑙𝑐 + 3∙𝐿 𝑐𝑜𝑚 =5.3𝑡 This solution is able to remove the bottleneck with 10 workers. The latency is slightly higher than the one of the sequential system High Performance Computing, G. Mencagli 21/11/2018

Pipeline Solution For each x: 𝑦 1 = 𝐹 1 𝑥 ; 𝑦 2 = 𝐹 2 𝑥 ;
𝑦 1 = 𝐹 1 𝑥 ; 𝑦 2 = 𝐹 2 𝑥 ; 𝑦 3 = 𝐹 3 𝑦 1 ,𝒔 ; 𝑦 4 = 𝐹 4 𝑦 2 ,𝒛 ; 𝑦= 𝐹 5 𝑦 3 , 𝑦 4 ; The idea of this solution is to design a pipeline of five parallel stages, each executing a function of the sequential code (five because we recognize 5 functions in the code) We can already say that this solution cannot remove the bottleneck: the number of functions (5) is smaller than the optimal parallelism degree (10) S1 S2 S3 S4 S5 𝑇 𝐴 𝑥 𝑥, 𝑦 1 𝑦 1 , 𝑦 2 𝑦 2 , 𝑦 3 𝑦 3 , 𝑦 4 𝑦 𝐹 1 𝐹 2 𝐹 3 𝐹 4 𝐹 5 s z Ideal service time of a stage: 𝑇 𝑖𝑑−𝑆 =𝑚𝑎𝑥 𝑇 𝐹 , 2𝐿 𝑐𝑜𝑚 =𝑡 Effective service time of the pipeline: 𝑇 𝑝𝑖𝑝𝑒 =𝑚𝑎𝑥 𝑇 𝐴 , 𝑇 𝑖𝑑−𝑆 =𝑡= 𝑇 𝑖𝑑−𝑝𝑖𝑝𝑒 Utilization factor is still greater than one: 𝜌 𝑃𝑖𝑝𝑒 = 𝑇 𝑖𝑑−𝑝𝑖𝑝𝑒 𝑇 𝐴 =2>1 Scalability is ideal: 𝑠 (5) = 𝑇 𝑖𝑑−𝑃 𝑇 𝑝𝑖𝑝𝑒 =5 Latency is higher than farm: 𝐿 𝑝𝑖𝑝𝑒 = 𝑇 𝑐𝑎𝑙𝑐 +5∙ 2𝐿 𝑐𝑜𝑚 =5𝑡+0.5𝑡=5.5𝑡 High Performance Computing, G. Mencagli 21/11/2018

Data-Flow Approach For each x:: 𝐹 1 : 𝑥 → 𝑦 1 𝑦 1 = 𝐹 1 𝑥 ;
We apply the Bernstain conditions to the sequential code For each x:: 𝑦 1 = 𝐹 1 𝑥 ; 𝑦 2 = 𝐹 2 𝑥 ; 𝑦 3 = 𝐹 3 𝑦 1 ,𝑠 ; 𝑦 4 = 𝐹 4 𝑦 2 ,𝑧 ; 𝑦= 𝐹 5 𝑦 3 , 𝑦 4 ; 𝐹 1 : 𝑥 → 𝑦 1 Can be executed in parallel 𝐹 2 : 𝑥 → 𝑦 2 𝐹 3 : 𝑦 1 ,𝑠 → 𝑦 3 Can be executed in parallel 𝐹 4 : 𝑦 2 ,𝑧 → 𝑦 4 𝐹 5 : 𝑦 3 , 𝑦 4 → 𝑦 By applying the Bernstain condition we discover that 𝑭 𝟏 and 𝑭 𝟐 can be executed in any order 𝑭 𝟑 must be executed after 𝑭 𝟏 𝑭 𝟒 must be executed after 𝑭 𝟐 𝑭 𝟓 must be executed after 𝑭 𝟑 and 𝑭 𝟒 According to this precedence relations (true data dependencies, i.e. read-after-write), we can build the data-flow graph. As usual, nodes are processes and arcs are LC communication channels High Performance Computing, G. Mencagli 21/11/2018

Data-Flow Solution s z S1 S2 S3 S4 S5 𝑥 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 𝐹 1 𝐹 2 𝐹 3
The data-flow graph is the following s z S1 S2 S3 S4 S5 𝑥 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 𝐹 1 𝐹 2 𝐹 3 𝐹 4 𝐹 5 Parallelism is naturally exploited among operations on the same stream element, for example 𝐹 1 𝑥 and 𝐹 2 𝑥 Parallelism is also exploited among operations on different stream elements, for example 𝐹 1 𝑥 and 𝐹 3 𝑦 1 ,𝑠 We already know that this solution is not able to remove the bottleneck, because the parallelism degree is lower than the optimal one High Performance Computing, G. Mencagli 21/11/2018

Analysis of the Data-Flow Graph
The effective service time of the data-flow graph is (as usual) the inter-departure time from the last node of the graph S1 S2 S3 S4 S5 𝑥 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 𝑻 𝑫−𝑺𝟑 =𝒕 𝑻 𝑫−𝑺𝟏 =𝒕 AND LOGIC 𝑻 𝑨 = 𝒕 𝟐 𝑻 𝑫𝑭 =? 𝑻 𝑫𝑭 =𝑡 𝑻 𝑫−𝑺𝟐 =𝒕 𝑻 𝑫−𝑺𝟒 =𝒕 𝑇 𝐷−𝑆1 =𝑚𝑎𝑥 𝑇 𝐴 , 𝑇 𝑖𝑑−𝑆 =𝑡 𝑇 𝐷−𝑆2 =𝑚𝑎𝑥 𝑇 𝐴 , 𝑇 𝑖𝑑−𝑆 =𝑡 Same bandwidth of the pipeline. However, the data-flow solution allows reducing the latency w.r.t the sequential system 𝐿 𝐷𝐹 =3∙ 𝑇 𝐹 + 𝐿 𝑐𝑜𝑚 =3∙ 𝑡+0.1𝑡 =3.3𝑡 𝑇 𝐷−𝑆3 =𝑚𝑎𝑥 𝑇 𝐷−𝑆1 , 𝑇 𝑖𝑑−𝑆 =𝑡 𝑇 𝐷−𝑆4 =𝑚𝑎𝑥 𝑇 𝐷−𝑆2 , 𝑇 𝑖𝑑−𝑆 =𝑡 Process 𝑺 𝟓 execution is fired by the presence of values in both the input channel (from 𝑆 3 and 𝑆 4 ) – deterministic situation! Called AND semantics (not OR semantics) Therefore, the inter-arrival time to 𝑺 𝟓 is the maximum between the inter-arrival time from the two input channels (AND semantics) High Performance Computing, G. Mencagli 21/11/2018

Example with Modifiable State
Similarly to the pipeline paradigm, a data-flow parallelization can be applied to a stateful computation provided that a proper analysis of data dependencies among operations is carried out Example: P is a process receiving a stream of integers x and encapsulating a variable s initilized to zero. For each x, P executes the following computation s For each x: 𝑦 1 = 𝐹 1 𝑥 ; 𝑦 2 ,𝑠 = 𝐹 2 𝑥,𝑠 ; 𝑦= 𝐹 3 𝑦 1 ,𝑠 ; P x (int) y (int) 𝐅 𝟐 also modifies the state variable s We can identify the domain and co-domain of the operations 𝐹 1 : 𝑥 → 𝑦 1 𝐹 2 : 𝑥,𝑠 → 𝑦 2 ,𝑠 𝐹 3 : 𝑦 1 ,𝑠 → 𝑦 Operations 𝑭 𝟏 and 𝑭 𝟐 can be executed in parallel, while 𝑭 𝟑 must be executed after them (true data dependency) High Performance Computing, G. Mencagli 21/11/2018

Data-Flow Graph S1 S2 S3 𝑥 𝑦 1 𝑦 2 ,𝑠 𝑦 s 𝐹 1 𝐹 2 𝐹 3 𝑻 𝑨
The computation 𝑭 𝟏 ; 𝑭 𝟐 ; 𝑭 𝟑 is equivalent to 𝑭 𝟏 || 𝑭 𝟐 ; 𝑭 𝟑 . The data-flow graph of this solution is shown below In this example we are assuming that the communication latency can be neglected (i.e. 𝐿 𝑐𝑜𝑚 ≈0) S1 S2 S3 𝑥 𝑦 1 𝑦 2 ,𝑠 𝑦 s 𝐹 1 𝐹 2 𝐹 3 𝑻 𝑨 𝑇 𝐷−𝑆1 =𝑚𝑎𝑥 𝑇 𝐴, 𝑇 𝐹 1 𝑻 𝑫− 𝑺𝟏 𝐿 𝐷𝐹 =𝑚𝑎𝑥 𝑇 𝐹 1 , 𝑇 𝐹 𝑇 𝐹 3 𝑳 𝑫𝑭 𝑇 𝐷−𝑆2 =𝑚𝑎𝑥 𝑇 𝐴, 𝑇 𝐹 2 𝑻 𝑫−𝑺𝟐 𝑇 𝐴−𝑆3 =𝑚𝑎𝑥 𝑇 𝐷−𝑆1 , 𝑇 𝐷−𝑆2 𝑻 𝑨−𝑺𝟑 𝑇 𝐷−𝑆3 =𝑚𝑎𝑥 𝑇 𝐴−𝑆3 , 𝑇 𝐹 3 = =𝑚𝑎𝑥 𝑇 𝐴 , 𝑇 𝐹 1 , 𝑇 𝐹 2 , 𝑇 𝐹 3 𝑻 𝑫−𝑺𝟑 High Performance Computing, G. Mencagli 21/11/2018

Other Dependencies s For each x: x (int) y (int) 𝑦=𝐹 𝑥,𝑠 ; P 𝑠=𝐺 𝑥,𝑠 ;
Example: a process P receives a stream of integers (let x the generic element) and encapsulates and integer variable s initialized to zero. The computation is the following P x (int) y (int) For each x: 𝑦=𝐹 𝑥,𝑠 ; 𝑠=𝐺 𝑥,𝑠 ; s 𝐅 𝟐 modifies the state variable s We can identify the domain and co-domain of the operations 𝐹: 𝑥,𝑠 → 𝑦 𝐺: 𝑥,𝑠 → 𝑠 According to the Bernstein conditions, operations 𝐹 and 𝐺 cannot be executed in parallel, i.e we have an anti-dependency (write-after-read) The first operation will read a variable (s) that will be modified by the second operation Assumptions: 𝐿 𝑐𝑜𝑚 ≈0 and 𝑇 𝐹 =4𝑡 and 𝑇 𝐺 =8𝑡 The inter-arrival time is: 𝑇 𝐴 =𝑡 High Performance Computing, G. Mencagli 21/11/2018

Data-Flow Graph F G s 𝑻 𝑨 =𝒕 x y 𝑻 𝑫𝑭 =𝟖𝒕 𝑻 𝑫−𝑮 =𝟖𝒕 For each x:
The data-flow graph is trivial since we have two operations with an anti-dependency However, owing to the fact that we process a stream of elements, we can overlap the calculation of the next output with the calculation of the next state value F G s x y 𝑻 𝑨 =𝒕 𝑻 𝑫−𝑮 =𝟖𝒕 𝑻 𝑫𝑭 =𝟖𝒕 For each x: 𝑦=𝐹 𝑥,𝑠 ; 𝑠=𝐺 𝑥,𝑠 ; We have reduced the service time from 12𝑡 to 8𝑡 but the bottleneck has not been eliminated High Performance Computing, G. Mencagli 21/11/2018

Characteristics of Parallelism Paradigms
High Performance Computing, G. Mencagli 21/11/2018

Parallel Paradigms Streams vs. Single Data Structures: some paradigms can be applied only when we recognize a large sequence of inputs to compute, notably Farm Pipeline Other paradigms can be applied both on streams and when we have a single input element to compute (or few of them), notably Data Parallel (MAP, stencils) Data Flow The input stream can be Primitive (e.g., sensors, networks) Generated by program (e.g., unpack-compute-pack scheme) Memory capacity: it is the sum of the memory usage of the modules composing the computation. It is Higher in Farm (n times replication) Lower in Data Parallel (potential hyper-scalability results?) They improve bandwidth but not latency! They improve bandwidth and latency! High Performance Computing, G. Mencagli 21/11/2018

Parallel Paradigms Partitioning vs. Replication: this is a key characteristic of the methodology. More specifically Farm exploits replication only, applied to functions and to non-modifiable data; computations are stateless Pipeline and Data-Flow exploit partitioning of functions and partitioning/replication of non- modifiable data. The case of partitioning of modifiable data can sometimes be handled Data Parallel exploits partitioning of data and replication of functions; computations may have an internal state Knowledge of the Sequential Computation: parallelism paradigms need proper requirements of the sequential code to be applied Farm is a black-box parallelization that does not need any information about the sequential code except that this is a pure (stateless) function Pipeline and Data-Flow require that the sequential code can be expressed as a linear composition of functions or as a partially-ordered graph of operations/functions Data Parallel requires proper knowledge of the sequential code in order to understand how to apply data partitioning and replication that characterize this approach (it will be studied extensively in the second part of the course, after the first midterm) High Performance Computing, G. Mencagli 21/11/2018

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback