Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster December 3, 2013 - Matlab Parallelization with the Matlab Distributed.

Similar presentations


Presentation on theme: "Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster December 3, 2013 - Matlab Parallelization with the Matlab Distributed."— Presentation transcript:

1 Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster
December 3, Matlab Parallelization with the Matlab Distributed Computing Server at the CBI cluster CBI Workshop Upcoming Workshop Tuesday Dec. 3rd, 2013: "Matlab Parallelization with the Matlab Distributed Computing Server at the CBI cluster" The CBI has a 64-Worker Matlab Distributed Computing Server (MDCS) installed on the cluster. The MDCS can not only speed up your computations, but can also eliminate the standard Matlab license usage. This workshop will teach you how to write a parallel Matlab code and how to run the code on the MDCS from your desktops/laptops. Agenda: Basic Parallelization Concepts presentation Utilizing the Distributed Computing Server on the CBI Cheetah Cluster with Matlab Hands-on demos When: 10:00am - 12:00pm, Tuesday Dec. 3rd, 2013 Speakers: Zhiwei Wang, the director of CBI, Nelson Ramirez, the senior software developer at CBI Where: CBI, UTSA, BSE 3.114 Acknowledgements: David Noriega: CBI System Administrator.

2 Overview Parallelization with Matlab using Parallel Computing Toolbox(PCT) Matlab Distributed Computing Server Introduction Benefits of using the MDCS CBI MDCS Usage Scenarios Hands-on Training

3 Parallelization with Matlab PCT
The Matlab Parallel Computing Toolbox provides access to multi-core, multi-system(MDCS), GPU parallelism. Many built-in Matlab functions directly support parallelism ( e.g. FFT ) transparently. Parallel constructs such as going from for loops to parfor loops. Allows handling of many different types of parallel software development challenges. MDCS allows scaling of locally developed parallel enabled Matlab applications. 12 workers locally on a multicore desktop For loops  parfor Interactive(pmode) & Batch mode Distributed arrays ( spmd )

4 Parallelization with Matlab PCT
Distributed / Parallel algorithm characteristics Memory Usage & CPU Usage Load a 4 Gigabyte file into Memory  Calculate averages Communication/Data IO patterns Read file 1 ( 10 Gigabytes )  Run a function Worker B Send data to worker A  run a function  return data to worker B Dependencies Function 1 Function 2 Function 3 Hardware resource contention ( e.g. 16 cores each trying to read /write a set of files, bandwidth limitations on RAM ) Managing large #’s of small files  Filesystem contention Key: Focus on dependency analysis How much of your program is independent determines potential parallelism at fixed data size( Amdahl’s Law ) [ S(N) = 1 / [ (1-P)+(P/N)], N-> inf, Speedup -> 1/(1-P) Gustafson’s Law ( You might not be able to get rid certain serial dependencies, but you may be able to tackle larger problems with more workers in the parallel sections ) S(P) = P-alpha*(P-1), P=# processors, S=speedup, alpha = sequential fraction of parallel runtime e.g. Car average mph example: ( Amdahl’s law: total trip distance is fixed, Gustafson’s law: total trip distance can expand as you have more “fuel” “parallel compute power” Data transfer vs. Compute ( Arithmetic Intensity ) Cost of moving the data from CPU to GPU needs to be taken into account. GPU may provide large benefit when ( compute >> data I/O ) Going to the store to get 100 items with 10 workers: you ideally only want to make 1 trip for all 100 items Even if all 10 workers go to get their items in parallel, not much benefit if you make 10 round trips. Resource contention Data transfer bandwidth( Memory bandwidth, Network bandwidth ) Resource limits ( memory, disk ) Hardware limits Memory cache line sizes, Memory alignment issues, Disk block sizes, Cache sizes, # Queues, etc. Physical data organization ( e.g. Row Major vs. Column Major ) Conditional (if-else) minimization Ideally you would hope to have 0 if statements in your functions…. Not always feasible for algorithm correctness. Synchronization Algorithm correctness many times requires some type of synchronization Many more variables affect function, program, … as well as system level parallelism…. A function may be highly parallelizable, but overall system parallelism may involve looking at different levels of parallel to achieve good solution.

5 Parallelization with Matlab PCT
Applications have layers of parallelism: For optimal solution, must look at the application as a whole. Scalability: use as many workers as possible in an efficient manner Matlab PCT + MDCS framework automates much of the complexity in developing parallel & distributed apps Clusters The Matlab PCT & MDCS addresses all these layers. MDCS Worker Processes ( a.k.a. “Labs”) The workers never request regular Matlab or toolbox licenses. The only license an MDCS worker ever uses is an MDCS worker license. Toolboxes are unlocked to an MDCS worker based on the licenses owned by the client during the job submission process. CPU’s, Multi-Cores GPU Cards/External Accelerator Cards

6 Parallelization with Matlab PCT & MDCS
CPU’s, Multi-Cores MDCS Cluster Distributed loops: parfor Interactive development mode (matlabpool/pmode) Distributed Arrays(spmd) Scale out with the MDCS Cluster in Batch Job Submission Mode Develop algorithms on your local system, at the multi-core level; then seamlessly scale on the MDCS CBI.

7 MDCS Benefits Performance: Scaling in compute & memory
MDCS Worker Processes ( a.k.a. “Labs”) The workers never request regular Matlab or toolbox licenses. The only license an MDCS worker ever uses is an MDCS worker license( of which we have up to 64 ). Toolboxes are unlocked to an MDCS worker based on the licenses owned by the client during the job submission process. Wonderful parallel algorithm development environment with the superior visualization & profiling capabilities of the Matlab environment. Many built-in functions are parallel enabled: fft, lu, svd… Distributed arrays allow development of data – parallel algorithms Enable the scaling of codes that cannot be compiled using the Matlab Compiler Toolbox. Allows you to go from development on a laptop directly to running on up to 64 MDCS Labs. ( Some simulations can go from years of runtime to days of runtime on 64 MDCS Labs) The Matlab Distributed Computing Server is a cluster software infrastructure built over the Message Passing Interface to allow the scaling of Parallel Compute Toolbox enabled codes. Performance: Provides compute+memory scalability Licensing: Workers only need their MDCS worker license( CBI Lab ), leaving regular Matlab and Toolbox licenses available for others to use(e.g. Statistics Toolbox). A regular Matlab license + Parallel Compute Toolbox license is only needed during the job submission process. Parallel Compute Toolbox constructs (parfor ~ OpenMP, spmd ~M PI, spmd codistributed arrays ~ MPI) scale seamlessly from a local system to the Distributed Server. Code development can be done on a user's workstation, then, when ready, the MDCS can be used to scale that code in both memory and compute dimensions. Multiple levels of parallelism can be implemented using the MDCS: Independent jobs(Distributed jobs ~ Task Computing) Complex fully parallel algorithms that require inter-process communication and synchronization( Parallel jobs~ parfor, spmd, labSend, labReceive, spmd Co-distributed arrays) Performance: Scaling in compute & memory Running PCT code on MDCS Profile vs Local Profile On a local profile, limit of 12 labs( R2012a) + Memory limits, IPC limits Up to 64 labs on CBI MDCS cluster ( limited by MDCS worker licenses ) Minimize regular Matlab +Toolbox license utilization( e.g. Statistics toolbox ) Each MDCS worker uses only a single MDCS worker license No regular Matlab licenses or toolbox licenses are checked out by MDCS workers Running code requiring non-compilable toolboxes ( SimBiology, others ) without using up licenses Job queues allow scaling to large number of jobs For example, running many jobs for a parameter scan of a time consuming parallel enabled simulation. Submit the jobs and the MDCS scheduler will manage the rest. Rapid prototyping of parallel algorithms Using Matlab+PCT+MDCS instead of C/C++/Fortran+OpenMP+ MPI directly Memory scaling w/ co-distributed arrays Minimize single-node memory utilization Can enable processing larger datasets in a distributed manner. Many built-in algorithms & toolboxes have some Parallel Compute Toolbox enablement e.g. fft, lu, svd, many more with SPMD(co-distributed arrays)

8 MDCS Structure The MDCS cluster is accessible via the Cheetah the CBI Laboratory. ssh –Y qlogin This takes you to an interactive development node, where you can setup your connection to the MDCS cluster. ( An allocation needs to be created on a per project basis, during a consulting meeting. )

9 Hardware/Software/Utilization @ CBI
MDCS worker processes run on 4 physical servers Dell PowerEdge M910: Four x 16 core systems, 4x64GB RAM, 2x Intel Xeon 2.26 Ghz/system with 8 cores per processor Total of 64 cores, with 256 GB total RAM distributed among systems Max 64 MDCS worker licenses available Subsets of MDCS workers can be created based on project needs Physical server implementation of CBI LAB.

10 Usage scenarios Local system: Interactive Use: ( matlabpool / spmd / pmode / mpiprofile ) Local system(e.g. one of the CBI ) as part of initial algorithm development. MDCS: Non-interactive Use: Job&Task based 2 main types: Independent vs. Communicating Jobs Both types can be used with either the local( on a non-cluster workstation ) or MDCS profile. - Local mode should be used for design/development, but for performance testing, the MDCS should be used. In local mode, each worker(“lab”) is mapped to an OS process running a Matlab worker. Starting up workers locally incurs overhead. - Up to 12 local workers can be used ( e.g. on a local workstation or laptop with Matlab + PCT ). The key point is that the same exact code that is developed locally can be run on the MDCS.

11 2 main types of workloads can be implemented with the MDCS:
MDCS Workloads 2 main types of workloads can be implemented with the MDCS: A job is logically decomposed into a set of tasks. The job may have 1 or more tasks, and each task may or may not have additional parallelism within it. CASE 1: Independent Within a job the parallelism is fully independent, we have the opportunity to use MDCS workers to offload some of the independent work units. The code will not make use parallel language features such as parfor, spmd. Note: In many cases, parfor can be transformed into a set of tasks. createJob() + createTask(), createTask(), … createTask() CASE 2: Communicating Within a single job the parallelism is more complex, requiring the workers to communicate or when parfor, spmd, codistributed arrays(language features are used from Parallel Compute Toolbox). createCommunicatingJob(), createTask() Case 1: Just like a grid scheduler. Each task is completely independent. For example, this way is well suited for parameters scanning types of workloads. Many times, parfor can be converted into a set of independent tasks and submitted to the MDCS. Case 2: There is communication within a single Task. For example, parfor requires communication between workers involved since the work of a single loop must be partitioned, data must be transferred to workers, and results must be gathered on the main node. All this is handled automatically by Matlab Parallel Compute Toolbox. CASE 1: Independent Within a job the parallelism is fully independent, we have the opportunity to use MDCS workers to offload some of the independent work units. The code will not make use parallel language features such as parfor, spmd. Note: In many cases, parfor can be transformed into a set of tasks. createJob() + createTask(), createTask(), … createTask() CASE 2: Communicating Within a single job the parallelism is more complex, requiring the workers to communicate or when parfor, spmd, codistributed arrays(language features are used from Parallel Compute Toolbox). createCommunicatingJob(), createTask() Note: interactive use of the matlabpool/spmd command is not recommended on the MDCS since it will lock up workers and bypasses Matlab scheduler. However, using it with the local configuration on a workstation is a useful way to develop and test your algorithm. Only 1 task can be created within a communicating job. matlabpool command should never found in the source code submitted to MDCS.

12 MDCS Working Environment
Click on the “Parallel” Button

13 MDCS Working Environment
The Cluster Profile Manager allows you to run tests to ensure your connection to the MDCS is working properly.

14 Interactive Mode Sample(parfor)
For well mapping workloads, parfor can yield exceptional performance improvement From years to days / days to hours for certain workloads: ideally case are long running jobs with little or no inter-job communication. Standard for loop In interactive mode, ( matlabpool ) parfor will automatically distribute the work for the set of loop iterations amongst MDCS workers. It is important to only run in interactive mode if you have exclusive access reserved for a set of MDCS workers, as interactive sessions impede others running on the MDCS cluster. % Matlab singleThreaded tic; for i = 1:64 for j = 1:100 % Some very long running process dataaverage(i) = mean(mean(fft2(rand(1000,1000)))); end toc; [ seconds running on compute-5-1, singleThreaded] % Matlab implicit parallelism ( Matlab by default tries to use as many cores as are available ) [ seconds running on compute-5-1, 16 cores, Matlab implicit multi-threading ] % Parfor demo matlabpool open 2 parfor i = 1:64 matlabpool close [ seconds running on MDCS] matlabpool open 4 [ seconds running on MDCS] matlabpool open 8 [ seconds running on MDCS ] matlabpool open 16 [ seconds running on MDCS ] matlabpool open 20 [ seconds running on MDCS ] matlabpool open 24 [ seconds running on MDCS ] matlabpool open 28 [ seconds running on MDCS ] matlabpool open 32 [ seconds running on MDCS ] matlabpool open 48 [ seconds running on MDCS ] matlabpool open 64 [ seconds running on MDCS ] Parfor enabled on the MDCS

15 MDCS Scaling ( Batch Mode )
Processing many images in batch mode, with 1 job + independent tasks for each image to be processed. Many times better scaling is achieved by moving up a level on the parallelism ladder. Instead of assigning more and more workers to process a single image, why not process more and more images with 1 worker per image?

16 MDCS Scaling( Batch mode )
MDCS job display panel.

17 MDCS Scaling ( Batch mode )
Results of batch parameter scan using MDCS Workers with 1 image per worker.

18 Summary Applied examples of using MDCS in Batch mode available as part of hands-on section or via consulting appointment for more in-depth MDCS usage information. We can allocate a subset of MDCS workers on a per project basis. Summary.

19 Summary Wonderful parallel algorithm design & development environment
Scale out codes up to 64 Matlab MDCS workers Both distributed compute & memory Standard Matlab+Toolbox license usage minimization Many options to approach parallelization of computational workloads. Parfor, spmd, distributed arrays, communicating jobs, batch independent jobs….

20 Acknowledgements This project received computational, research & development, software design/development support from the Computational System Biology Core/Computational Biology Initiative, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. URL:

21 Contact Us

22 Appendix A See the parforFFTdemo.m file for full source code.

23 Local Mode: Matlab Worker Process/Thread Structure
Parallel Toolbox constructs can be tested in local mode, the “lab” abstraction allows the actual process used for a lab to reside either locally or on a distributed server node. MPI used for inter-process communication between “Labs”, Matlab Worker Processes Note: Matlab uses as many threads as there are physical cores on a system by default. ( The –singleCompThread option can be added to the command line when starting Matlab to force Matlab to use a single computational thread ) The maxNumCompThreads function will return the current maximum number of computational threads being used.

24 Local Mode Scaling Sample(parfor)
Using more workers(“labs”) than available physical cpu cores will not improve performance.

25 Interactive Mode Sample(pmode/spmd)
Each lab handles a piece of the data. Results are gathered on lab 1. Client session requests the complete data set to be sent to it using lab2client In interactive mode, ( pmode ) you have direct command line access to multiple labs. Where each lab is identified by the variable “labindex” This allows you to create distributed arrays and have each workers process different sections of the matrix.

26 Local vs. MDCS Mode Compare (parfor)
Shows adding more workers to process a single image. There are many options when choosing a parallelization strategy, might be better to use a single worker per image and run the through a set of batch jobs with 1 job per image.

27 Appendix B: MDCS Access
Access to MDCS provided via Cheetah Cluster. On Linux: ssh –Y qlogin matlab & Access available via both Windows & Linux systems.

28 Appendix B: MDCS Access
Access to MDCS provided via Cheetah Cluster. On Windows: Using PuTTY + Xming w/X11 forwarding qlogin matlab & Access available via both Windows & Linux systems. Refer to the CBI xforwarding guide:

29 References [1] ( Parallel Computing Toolbox reference ) [2] (Parallel Computing Toolbox) [3] ( Parallel Computing Toolbox ) [4] ( MDCS License Management ) [5] ( MDCS Architecture Overview ) [6] ( MDCS Architecture Overview: Scalability ) [7] ( Built-in MDCS support ) [8] ( MDCS Licensing ) [9] ( PCS) [10] ( Compiler Support for MATLAB and Toolboxes ) [11] ( SGE Integration ) [12] ( MDCS Administration ) [13] ( General MDCE Workflow ) [14] ( Independent Jobs with MDCS ) [15] ( Umich ) [16] ( Optimization toolbox example ) [17] ( MDCS Examples ) [18] ( MDCS Installation Guide R2012a ) [19] ( PSC ) [20] ( Penn State ) [21] ( U of Buffalo) [22] ( Cornell ) [23] ( MDCS Licensing ) [24] ( MDCS Architecture )

30 References [25] ( Built-in functions that work with distributed arrays ) [26] ( Aachen University ) [27] ( Compiled Matlab Applications using PCT + MDCS) [28] ( UNSW ) [29] ( Batch command ) [30] ( Purdue ) [31] ( Parallel Computing Toolbox R2012a ) [32] ( Nasa ) [33] ( PCT, MDCS R2012a interface changes ) [34] ( Communicating jobs ) [35] ( Moving parfor loops to jobs+tasks ) [36] ( FSU: Task based parallelism ) [37] ( Virginia Tech: Parfor parallelism ) [38] ( FSU, HPC main site ) [39] ( PCT Updates in R2012a ) [40] ( Built in functions available for Co-Distributed arrays ) [41] ( Matlab Boston University ) [42] [43] [44] [45] [46] [47] [48] [49]


Download ppt "Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster December 3, 2013 - Matlab Parallelization with the Matlab Distributed."

Similar presentations


Ads by Google