2PCT Parallel Computing Toolbox Offload work from one MATLAB session (the client) to other MATLAB sessions (the workers).Run as many as eight MATLAB workers (R2010b) on your local machine in addition to your MATLAB client session.推荐一核不超过一个worker
3MDCS MATLAB Distributed Computing Server Run as many MATLAB workers on a remote cluster of computers as your licensing allows.Run workers on your client machine if you want to run more than eight local workers (R2010b).Scheduler/job manager: 专门负责任务分配。
5Typical Use Cases Parallel for-Loops Batch Jobs Large Data Sets Many iterationsLong iterationsBatch JobsLarge Data SetsBatch JobsWhen working interactively in a MATLAB session, you can offload work to a MATLAB worker session to run as a batch job. The command to perform this job is asynchronous, which means that your client MATLAB session is not blocked, and you can continue your own interactive session while the MATLAB worker is busy evaluating your code. The MATLAB worker can run either on the same machine as the client, or if using MATLAB Distributed Computing Server, on a remote cluster machine.
6Parfor Parallel for-loop Has the same basic concept with “for”. Parfor body is executed on the MATLAB client and workers.The necessary data on which parfor operates is sent from the client to workers, and the results are sent back to the client and pieced together.MATLAB workers evaluate iterations in no particular order, and independently of each other.
7Parfor A = zeros(1024, 1); for i = 1:1024 A(i) = sin(i*2*pi/1024); end plot(A)parallelizationA = zeros(1024, 1);matlabpool open local 4parfor i = 1:1024A(i) = sin(i*2*pi/1024);endmatlabpool closeplot(A)
8TimingA = zeros(n, 1);ticfor i = 1:nA(i) = sin(i);endtocA = zeros(n, 1);matlabpool open local 8ticparfor i = 1:nA(i) = sin(i);endtocnforparfor10000
9When to Use Parfor? Each loop must be independent of other loops. Lots of iterations of simple calculations.orLong iterations.Small number of simple calculations.
10Classification of Variables broadcast variablesliced input variableloop variablereduction variablesliced output variabletemporary variableTemporary variable: parfor结束后数据销毁Loop variable: parfor结束后值为0；Sliced variable: 可对其进行并行操作。Reduction variable:In a parfor-loop, the value of z is never transmitted from client to workers or from worker to worker. Rather, additions of i are done in each worker, with i ranging over the subset of 1:n being performed on that worker. The results are then transmitted back to the client, which adds the workers' partial sums into z. Thus, workers do some of the additions, and the client does the rest.
11More Notes d = 0; i = 0; for i = 1:4 b = i; d = i*2; A(i)= d; end parfor i = 1:4b = i;d = i*2;A(i)= d;endA[2,4,6,8]d8i4bA[2,4,6,8]dib/A(i): slice output variabled, b: temporary variablei: loop variable变量可以从client传递到worker中用，但并不能改变此变量的值，循环结束此变量值不变；但Parfor内定义的临时变量循环结束后就消失了（如在parfor外不定义d = 0,结束后d变量不存在）。
12More Notes How to parallelize? C = 0; for i = 1:m for j = i:n C = C + i * j;endHow to parallelize?C: reduction variable
14Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 ) q = 0.0; u = (x2 - x1)/m; v = (y2 - y1)/n; for i = 1:m x = x1 + u * i; for j = 1:n y = y1 + v * j; fx = x^2 + y^2; q = q + u * v * fx; end
15Parfor: Estimating an Integral Computation complexity: O(m*n)Each iteration is independent of other iterations.We can replace “for” with “parfor”, for either loop index i or loop index j.
16Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 )q = 0.0;u = (x2 - x1)/m;v = (y2 - y1)/n;parfor i = 1:mx = x1 + u * i;for j = 1:ny = y1 + v * j;fx = x^2 + y^2;q = q + u * v * fx;endtic A = quad_fun(m,n,0,3,0,3); tocWhy (1000,1000) takes less time than (100,100)? It doesn’t, really!How can "1+1" take longer than "1+0"?(It does, but it's probably not as bad as it looks!)Parallelism doesn't pay until your problem is big enough;Parallelism doesn't pay until you have a decent number of workers.(m, n)1 + 01 + 11 + 21 + 31 + 4(100, 100)0.0050.2550.0870.1010.114(1000, 1000)0.0350.0660.0460.0450.053(10000, 10000)3.1231.6261.1430.883(100000, )85.185
17Parfor: Estimating an Integral function q = quad_fun( m, n, x1, x2, y1, y2 )q = 0.0;u = (x2 - x1)/m;v = (y2 - y1)/n;for i = 1:mx = x1 + u * i;parfor j = 1:ny = y1 + v * j;fx = x^2 + y^2;q = q + u * v * fx;endtic A = quad_fun(m,n,0,3,0,3); toc(m, n)1 + 01 + 11 + 21 + 31 + 4(100, 100)0.0051.7541.9752.1262.612(1000, 1000)0.03513.14615.28618.66122.313(10000, 10000)3.123(100000, )
18SPMD SPMD: Single Program Multiple Data. SPMD command is like a very simplified version of MPI.The spmd statement lets you define a block of code to run simultaneously on multiple labs, each lab can have different, unique data for that code.Labs can communicate directly via messages, they meet at synchronization points.The client program can examine or modify data on any lab.
21SPMDMATLAB sets up the requested number of labs, each with a copy of the program. Each lab “knows" it's a lab, and has access to two special functions:numlabs(), the number of labs;labindex(), a unique identifier between 1 and numlabs().
23Distributed Arrays Distributed() You can create a distributed array in the MATLAB client, and its data is stored on the labs of the open MATLAB pool. A distributed array is distributed in one dimension, along the last nonsingleton dimension, and as evenly as possible along that dimension among the labs. You cannot control the details of distribution when creating a distributed array.Distributed array: 分布式矩阵Distributed()函数可用于将client中定义的矩阵，分布到各个lab中。分布方式只能沿一个维度分开，默认竖直方向分开，一般尽量平均分配在各个lab中，和parfor一样，不能控制分布的具体细节。W在逻辑上仍未一个完整的矩阵，但实际上是分块儿存储在不同的lab中的。
24Distributed Arrays Codistributed() You can create a codistributed array by executing on the labs themselves, either inside an spmd statement, in pmode, or inside a parallel job. When creating a codistributed array, you can control all aspects of distribution, including dimensions and partitions.Codistributed()函数把labs中存储的相同的矩阵变量分布在各个lab中，节约存储空间。
25Distributed Arrays Codistributed() You can create a codistributed array by executing on the labs themselves, either inside an spmd statement, in pmode, or inside a parallel job. When creating a codistributed array, you can control all aspects of distribution, including dimensions and partitions.
27Example: TrapezoidTo simplify things, we assume interval is [0, 1] , and we'll let each lab define a and b to mean the ends of its subinterval. If we have 4 labs, then lab number 3 will be assigned [ ½, ¾].
29Parallel computing synchronously Pmodepmode lets you work interactively with a parallel job running simultaneously on several labs.Commands you type at the pmode prompt in the Parallel Command Window are executed on all labs at the same time.Each lab executes the commands in its own workspace on its own variables.Pmode每个lab都有一个窗口，你可以输入命令，看到在每个lab中的运行结果，进入lab的workspace.Spmd结束后其中的数据和信息都还存在，可以重新进入使用；pmode退出后，作业销毁，里面的数据就都没了，重新开启是一个新的开始。The way the labs remain synchronized is that each lab becomes idle when it completes a command or statement, waiting until all the labs working on this job have completed the same statement. Only when all the labs are idle, do they then proceed together to the next pmode command.pmodespmdParallel computing synchronouslyEach lab has a desktopNo desktop for labsCan’t freely interleave serialand parallel workCan freely interleave serial
31Pmode labindex() and numlabs() still work; Variables only have the same name, they are independent of each other.
32Pmode Aggregate the array segments into a coherent array. codist = codistributor1d(2, [ ], [3 8])whole = codistributed.build(segment, codist)Codistributor1d: 1-D distribution scheme for codistributed arraycodistributed.build为构造函数
33Pmode Aggregate the array segments into a coherent array. whole = wholesection = getLocalPart(whole)getLocalPart可以获取大矩阵分布在各个lab的小矩阵
34Pmode Aggregate the array segments into a coherent array combined = gather(whole)Gather()把分布在lab中的分布式阵列整合在一起输出在client中。
35Pmode How to change distribution? distobj = codistributor1d() I = eye(6, distobj)getLocalPart(I)distobj = codistributor1d(1);I = redistribute(I, distobj)getLocalPart(I)
36GPU Computing Capabilities Requirements Transferring data between the MATLAB workspace and the GPUEvaluating built-in functions on the GPURunning MATLAB code on the GPUCreating kernels from PTX files for execution on the GPUChoosing one of multiple GPU cards to useRequirementsNVIDIA CUDA-enabled device with compute capability of 1.3 or greaterNVIDIA CUDA device driver 3.1 or greaterNVIDIA CUDA Toolkit 3.1 (recommended) for compiling PTX files
37GPU Computing Transferring data between workspace and GPU Creating GPU dataN = 6;M = magic(N);G = gpuArray(M);M2 = gather(G);
38result = arrayfun(@myFunction, arg1, arg2); GPU ComputingExecuting code on the GPUYou can transfer or create data on the GPU, and use the resulting GPUArray as input to enhanced built-in functions that support them.You can run your own MATLAB function file on a GPU.If any of arg1 and arg2 is a GPUArray, the function executes on the GPU and return a GPUArrayIf none of the input arguments is GPUArray, then arrayfun executes in CPU.Only element-wise operations are supported.result = arg1, arg2);Arrayfun: apply function to each element of array, not specified for GPU.
39Review What is the typical use cases of parallel Matlab? When to use parfor?What’s the difference between worker(parfor) and lab(spmd)?What’s the difference between spmd and pmode?How to build distributed array?How to use GPU for Matlab parallel computing?