Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 20111 MATLAB Tutorial Series Tuning MATLAB for Better Performance Kadin Tseng Scientific Computing and Visualization, IS&T Boston University.

Similar presentations


Presentation on theme: "Spring 20111 MATLAB Tutorial Series Tuning MATLAB for Better Performance Kadin Tseng Scientific Computing and Visualization, IS&T Boston University."— Presentation transcript:

1 Spring 20111 MATLAB Tutorial Series Tuning MATLAB for Better Performance Kadin Tseng Scientific Computing and Visualization, IS&T Boston University

2 Spring 20112 Topics Covered 1. Performance Issues 1.1 Memory Allocations 1.1 Memory Allocations 1.2 Vector Representations 1.2 Vector Representations 1.3 Compiler 1.3 Compiler 1.4 Other Considerations 1.4 Other Considerations 2. Multiprocessing with MATLAB

3 Spring 20113 1.1 Memory Access 1.2 Vector Representations 1.3 Compiler 1.4 Other Considerations 1. Performance Issues

4 Spring 20114 Memory access patterns often affect computational performance. Some effective ways to enhance performance in MATLAB :  Allocate array memory before using it  For-loops Ordering  Compute and save array in-place where applicable. 1.1 Memory Access

5 Spring 20115  MATLAB arrays are allocated in contiguous address space. How Does MATLAB Allocate Arrays ? Without pre-allocation x = 1; for i=2:4 x(i) = i; end Memory Address Array element 1x(1) …... 2000x(1) 2001x(2) 2002x(1) 2003x(2) 2004x(3)... 10004x(1) 10005x(2) 10006x(3) 10007x(4)

6 Spring 20116  MATLAB arrays are allocated in contiguous address space.  Pre-allocate arrays enhance performance significantly. How … Arrays ? Examples n=5000; tic for i=1:n x(i) = i^2; end toc Wallclock time = 0.00046 seconds n=5000; x = zeros(n,1); tic for i=1:n x(i) = i^2; end toc Wallclock time = 0.00004 seconds The timing data are recorded on Katana. The actual times may vary depending on the processor.

7 Spring 20117 MATLAB uses pass-by-reference if passed array is used without changes; a copy will be made if the array is modified. MATLAB calls it “lazy copy.” Consider the following example: function y = lazyCopy(A, x, b, change) If change, A(2,3) = 23; end % change forces a local copy of a y = A*x + b; % use x and b directly from calling program pause(2) % keep memory longer to see it in Task Manager On Windows, can use Task Manager to monitor memory allocation history. >> n = 5000; A = rand(n); x = rand(n,1); b = rand(n,1); >> y = lazyCopy(A, x, b, 0); >> y = lazyCopy(A, x, b, 1); Passing Arrays Into A Function

8 Spring 20118  Best if inner-most loop is for array left-most index, etc. (column- major)  For a multi-dimensional array, x(i,j), the 1D representation of the same array, x(k), follows column-wise order and inherently possesses the contiguous property For-loop Ordering n=5000; x = zeros(n); for i=1:n % rows for j=1:n % columns x(i,j) = i+(j-1)*n; end Wallclock time = 0.88 seconds n=5000; x = zeros(n); for j=1:n % columns for i=1:n % rows x(i,j) = i+(j-1)*n; end Wallclock time = 0.48 seconds for i=1:n*n x(i) = i; end x = 1:n*n;

9 Spring 20119  Compute and save array in-place improves performance (and reduce memory usage) Compute In-place x = rand(5000); tic y = x.^2; toc Wallclock time = 0.30 seconds x = rand(5000); tic x = x.^2; toc Wallclock time = 0.11 seconds Caveat: May not be worthwhile if it involves data type or size change …

10 Spring 201110  Generally, better to use function instead of script  m-file  Script m-file is loaded into memory and evaluate one line at a time. Subsequent uses require reloading.  Function m-file is compiled into a pseudo-code and is loaded on first application. Subsequent uses of the function will be faster without reloading.  Function is modular; self cleaning; reusable.  Global variables are expensive; difficult to track.  Physical memory is much faster than virtual mem.  Avoid passing large matrices to a function and modifying only a handful of elements. Other Considerations

11 Other Considerations (cont’d)  load and save are efficient to handle whole data file; textscan is more memory-efficient to extract text meeting specific criteria.  Don’t reassign array that results in change of data type or shape.  Limit m-files size and complexity.  Computationally intensive jobs often require large memory …  Structure of array more memory-efficient than array of structures. Spring 201111

12 Spring 201112  Maximize memory availability.  32-bit systems < 2 or 3 GB  64-bit systems running 32-bit MATLAB < 4GB  64-bit systems running 64-bit MATLAB < 8TB  (93 GB on some Katana nodes)  Minimize memory usage. (Details to follow …) Memory Management

13 Spring 201113  Use clear, pack or other memory saving means when possible. If double precision (default) is not required, the use of ‘single’ data type could save substantial amount of memory. For example, >> x=ones(10,'single'); y=x+1; % y inherits single from x >> x=ones(10,'single'); y=x+1; % y inherits single from x  Use sparse to reduce memory footprint on sparse matrices >> n=5000; A = zeros(n); A(3,2) = 1; B = ones(n); >> n=5000; A = zeros(n); A(3,2) = 1; B = ones(n); >> C = A*B; >> C = A*B; >> As = sparse(A); >> As = sparse(A); >> Cs = As*B; % it can save time for low density >> Cs = As*B; % it can save time for low density >> A2 = sparse(n,n); A2(3,2) = 1; >> A2 = sparse(n,n); A2(3,2) = 1; Minimize Memory Usage

14 Spring 201114  Use “matlab –nojvm …” saves lots of memory – if warranted  Memory usage query For Unix: For Unix: Katana% top Katana% top For Windows: For Windows: >> m = feature('memstats'); % largest contiguous free block >> m = feature('memstats'); % largest contiguous free block Use MS Windows Task Manager to monitor memory allocation. Use MS Windows Task Manager to monitor memory allocation.  Distribute memory among multiprocessors via MATLAB Parallel Computing Toolbox. Parallel Computing Toolbox. Minimize Memory Usage (Cont’d)

15 Spring 201115 MATLAB provides a few functions for processing real, noncomplex, data specifically. These functions are more efficient than their generic versions:  realpow – power for real numbers  realsqrt – square root for real numbers  reallog – logarithm for real numbers  realmin/realmax – min/max for real numbers Special Functions for Real Numbers n = 1000; x = 1:n; x = x.^2; tic x = sqrt(x); toc Wallclock time = 0.00022 seconds n = 1000; x = 1:n; x = x.^2; tic x = realsqrt(x); toc Wallclock time = 0.00004 seconds isreal reports whether the array is real single/double converts data to single-/double-precision

16 Spring 201116  MATLAB is designed for vector and matrix operations. The use of for-loop, in general, can be expensive, especially if the loop count is large or nested.  Without array pre-allocation, its size extension in a for-loop is costly as shown before.  From a performance standpoint, in general, vector representation should be used in place of for-loops. Vector Operations i = 0; for t = 0:.01:100 i = i + 1; y(i) = sin(t); end Wallclock time = 0.1069 seconds t = 0:.01:100; y = sin(t); Wallclock time = 0.0007 seconds

17 Spring 201117 Vector Operations of Arrays >> A = magic(3) % define a 3x3 matrix A A = 8 1 6 8 1 6 3 5 7 3 5 7 4 9 2 4 9 2 >> B = A^2; % B = A * A; >> C = A + B; >> b = 1:3 % define b as a 1x3 row vector b = 1 2 3 1 2 3 >> [A, b'] % add b transpose as a 4th column to A ans = 8 1 6 1 8 1 6 1 3 5 7 2 3 5 7 2 4 9 2 3 4 9 2 3

18 Spring 201118 Vector Operations >> [A; b] % add b as a 4th row to A ans = 8 1 6 8 1 6 3 5 7 3 5 7 4 9 2 4 9 2 1 2 3 1 2 3 >> A = zeros(3) % zeros generates 3*3 array of 0’s A = 0 0 0 0 0 0 >> B = 2*ones(2,3) % ones generates 2 * 3 array of 1’s B = 2 2 2 2 2 2 Alternatively, >> B = repmat(2,2,3) % matrix replication

19 Spring 201119 Vector Operations >> y = (1:5)’; >> n = 3; >> B = y(:, ones(1,n)) % B = y(:, [1 1 1]) or B=[y y y] B = 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 Again, B can be generated via repmat as >> B = repmat(y, 1, 3);

20 Spring 201120 Vector Operations >> A = magic(3) A = 8 1 6 8 1 6 3 5 7 3 5 7 4 9 2 4 9 2 >> B = A(:, [1 3 2]) % switch 2nd and third columns of A B = 8 6 1 8 6 1 3 7 5 3 7 5 4 2 9 4 2 9 >> A(:, 2) = [ ] % delete second column of A A = 8 6 8 6 3 7 3 7 4 2 4 2

21 Spring 201121 Vector Operation Example n = 3000; x = zeros(n); for j=1:n for i=1:n x(i,j) = i+(j-1)*n; x(i,j) = x(i,j)^2; end Wallclock time = 0.128 seconds n = 3000; i = (1:n) '; I = repmat(i,1,n); % replicate along j x = I + (I'-1)*n; x = x.^2; Wallclock time = 0.127 seconds Notes on the vector form of the computations : To eliminate the for-loops, all values of i and j must be made available at once. The ones or repmat utilities can be used to replicate rows or columns. In this special case, J = I' is used to save computations and memory. Often, there are trade-offs between efficiency and memory using the vector form. Here, the creation of I adds to the memory and compute time. However, as more works can be leveraged against I, efficiency improves.

22 Spring 201122 Functions Useful in Vectorizing FunctionDescription allTest to see if all elements are of a prescribed value anyTest to see if any element is of a prescribed value zerosCreate array of zeroes onesCreate array of ones repmatReplicate and tile an array findFind indices and values of nonzero elements diffFind differences and approximate derivatives squeezeRemove singleton dimensions from an array prodFind product of array elements sumFind the sum of array elements cumsumFind cumulative sum shiftdimShift array dimensions logicalConvert numeric values to logical sortSort array elements in ascending /descending order

23 Spring 201123 Laplace Equation Laplace Equation: Boundary Conditions: (1) (2) Analytical solution: (3)

24 Spring 201124 Discrete Laplace Equation Discretize Equation (1) by centered-difference yields: where n and n+1 denote the current and the next time step, respectively, while For simplicity, we take (4) (5)

25 Spring 201125 Computational Domain x, i y, j

26 Spring 201126 Five-point Finite-Difference Stencil x Interior cells. Where solution of the Laplace equation is sought. (i, j) Exterior cells. Green cells denote cells where homogeneous boundary conditions are imposed while non-homogeneous boundary conditions are colored in blue. x x x x o x x x x o

27 Spring 201127 SOR Update Function How to vectorize it ? 1.Remove the for loops 2.Define i = ib:2:ie; 3.Define j = jb:2:je; 4.Use sum for del % equivalent vector code fragment jb = 2; je = n+1; ib = 3; ie = m+1; i = ib:2:ie; j = jb:2:je; up = ( u(i,j+1) + u(i+1,j ) +... u(i-1,j ) + u(i,j-1) )*0.25; u(i,j) = (1.0 - omega)*u(i,j) + omega*up; del = del + sum(sum(abs(up-u(i,j)))); % original code fragment jb = 2; je = n+1; ib = 3; ie = m+1; for i=ib:2:ie for j=jb:2:je up = ( u(i,j+1) + u(i+1,j ) +... u(i-1,j ) + u(i,j-1) )*0.25; u(i,j) = (1.0 - omega)*u(i,j) +omega*up; del = del + abs(up-u(i,j)); end More efficient way ?

28 Spring 201128 Solution Contour Plot Solution Contour Plot

29 Spring 201129 m Matrix size Wallclock ssor2Dij for loops Wallclock ssor2Dji reverse loops Wallclock ssor2Dv vector 1281.010.980.26 2568.077.641.60 51265.8160.4911.27 1024594.91495.92189.05 SOR Update Function

30 Spring 201130 Integration Example An integration of the cosine function between 0 and π /2 Integration scheme is mid-point rule for simplicity. Several parallel methods will be demonstrated. cos(x) a = 0; b = pi/2; % range m = 8; % # of increments h = (b-a)/m; % increment p = numlabs; n = m/p; % inc. / worker ai = a + (i-1)*n*h; aij = ai + (j-1)*h; h x=bx=a mid-point of increment Worker 1 Worker 2 Worker 3 Worker 4

31 Spring 201131 Integration Example — the Kernel function intOut = Integral(fct, a, b, n) %function intOut = Integral(fct, a, b, n) % performs mid-point rule integration of "fct" % fct -- integrand (cos, sin, etc.) % a -- starting point of the range of integration % b –- end point of the range of integration % n -- number of increments % Usage example: >> Integral(@cos, 0, pi/2, 500) % 0 to pi/2 h = (b – a)/n; % increment length intOut = 0.0; % initialize integral for j=1:n % sum integrals aij = a +(j-1)*h; % function is evaluated at mid-interval intOut = intOut + fct(aij+h/2)*h; end Vector form of the function: function intOut = Integral(fct, a, b, n) h = (b – a)/n; aij = a + (0:n-1)*h; intOut = sum(fct(aij+h/2))*h;

32 Spring 201132 Integration Example — Serial Integration % serial integration tic m = 10000; a = 0; b = pi*0.5; intSerial = Integral(@cos, a, b, m); toc

33 Spring 201133 Integration Example Benchmarks  Timings (seconds) obtained on a quad-core Xeon X5570  Computation linearly proportional to # of increments.  FORTRAN and C timings are an order of magnitude faster. increment For loopvector 10000 0.011 0.0002 20000 0.022 0.0004 40000 0.044 0.0008 800000.0870.0016 1600000.17340.0033 3200000.34880.0069

34 Spring 201134 mcc is a MATLAB compiler:  It compiles m-files into C codes, object libraries, or stand-alone executables.  A stand-alone executable generated with mcc can run on compatible platforms without an installed MATLAB or a MATLAB license.  Many MATLAB general and toolbox licenses are available. Infrequently, MATLAB access may be denied if all licenses are checked out. Running a stand-alone requires NO licenses and no waiting. Compiler

35 Compiler (Cont’d)  Some compiled codes may run more efficiently than m-files because they are not run in interpretive mode.  A stand-alone enables you to share it without revealing the source. www.bu.edu/tech/research/training/tutorials/matlab/vector/miscs/compiler/ Spring 201135

36 Compiler (Cont’d) How to build a standalone executable >> mcc –o ssor2Dijc –m ssor2Dij How to run ssor2Dijc on Katana Katana% run_ssor2Dijc.sh /usr/local/apps/matlab_2009b 256 256 Details: The m-file is ssor2Dij.mThe m-file is ssor2Dij.m Input arguments to code are processed as strings by mcc. ConvertInput arguments to code are processed as strings by mcc. Convert with str2num if need be. ssor2Dij.m requires 2 inputs; m, n with str2num if need be. ssor2Dij.m requires 2 inputs; m, n if isdeployed, m=str2num(m); end if isdeployed, m=str2num(m); end Output cannot be returned; either save to file or display on screen.Output cannot be returned; either save to file or display on screen. The executable is ssor2DijcThe executable is ssor2Dijc run_ssor2Dijc.sh is the run script generated by mcc.run_ssor2Dijc.sh is the run script generated by mcc. None of SOR codes benefits, in runtime, from mcc.None of SOR codes benefits, in runtime, from mcc. Spring 201136 MATLAB root Input arguments

37 Spring 201137  profile - profiler to identify “hot spots” for performance enhancement.  mlint - for inconsistencies and suspicious constructs in M-files.  debug - MATLAB debugger.  guide - Graphical User Interface design tool. MATLAB Programming Tools

38 Spring 201138 To use profile viewer, DONOT start MATLAB with –nojvm option >> profile on –detail 'builtin' –timer 'real' >> % run code to be >> % profiled here >> % >> profile viewer % view profiling data >> profile off % turn off profiler Profiling example. >> profile on >> ssor2Dij % profiling the SOR Laplace solver >> profile viewer >> profile off MATLAB profiler '' '' Turns on profiler. –detail 'builtin' enables MATLAB builtin functions; -timer 'real' reports wallclock time.

39 Spring 201139 Two ways to save profiling data: 1.Save into a directory of HTML files Viewing is static, i.e., the profiling data displayed correspond to a Viewing is static, i.e., the profiling data displayed correspond to a prescribed set of options. View with a browser. prescribed set of options. View with a browser. 2. Saved as a MAT file Viewing is dynamic; you can change the options. Must be viewed Viewing is dynamic; you can change the options. Must be viewed in the MATLAB environment. in the MATLAB environment. How to Save Profiling Data

40 Spring 201140 Viewing is static, i.e., the profiling data displayed correspond to a prescribed set of options. View with a browser. >> profile on >> plot(magic(20)) >> profile viewer >> p = profile('info'); >> profsave(p, ‘my_profile') % html files in my_profile dir Profiling – save as HTML files

41 Spring 201141 Viewing is dynamic; you can change the options. Must be viewed in the MATLAB environment. >> profile on >> plot(magic(20)) >> profile viewer >> p = profile('info'); >> save myprofiledata p >> clear p >> load myprofiledata >> profview(0,p) Profiling – save as MAT file

42 Spring 201142  mlint is used to identify coding inconsistencies and make coding performance improvement recommendations.  mlint is a standalone utility; it is an option in profile.  MATLAB editor provides this feature.  Debug mode can also be invoked through editor. MATLAB “grammar checker”

43 Spring 201143 Running MATLAB in Command Line Mode and Batch Katana% matlab -nodisplay –nosplash –r “n=4, myfile(n); exit” Add –nojvm to save memory if Java is not required For batch jobs on Katana, use the above command in the batch script. Visit http://www.bu.edu/tech/research/computation/linux- cluster/katana-cluster/runningjobs/ for instructions on running batch jobs.http://www.bu.edu/tech/research/computation/linux- cluster/katana-cluster/runningjobs/

44 Spring 2011 Comment Out Block Of Statements On occasions, one wants to comment out an entire block of lines. If you use the MATLAB editor: Select statement block with mouse, thenSelect statement block with mouse, then –press Ctrl R keys to insert % to column 1 of each line. –press Ctrl T keys to remove % on column 1 of each line. If you use another editor: %{%{ n = 3000; n = 3000; x = rand(n); x = rand(n); %} %} if 0if 0 n = 3000; n = 3000; x = rand(n); x = rand(n); end end

45 Spring 201145 Explicit parallel operationsExplicit parallel operations MATLAB Parallel Computing Toolbox Tutorial MATLAB Parallel Computing Toolbox Tutorial www.bu.edu/tech/research/training/tutorials/matlab-pct/ www.bu.edu/tech/research/training/tutorials/matlab-pct/www.bu.edu/tech/research/training/tutorials/matlab-pct/ Implicit parallel operationsImplicit parallel operations –Require shared-memory computer architecture (i.e., multicore). –Feature on by default. Turn it off with katana% matlab –singleCompThread katana% matlab –singleCompThread –Specify number of threads with maxNumCompThreads To be deprecated. Still supported on Version 2010b To be deprecated. Still supported on Version 2010b –Activated by vector operation of applications such as hyperbolic or trigonometric functions, some LaPACK routines, Level-3 BLAS. –See “Implicit Parallelism” section of the above link. Multiprocessing With MATLAB

46 Spring 201146 Useful SCV Info  SCV home page (http://scv.bu.edu/) SCV home page SCV home page  Resource Applications (https://acct.bu.edu/SCF) https://acct.bu.edu/SCF  Help  Web-based tutorials (http://scv.bu.edu/)  (MPI, OpenMP, MATLAB, IDL, Graphics tools)  HPC consultations by appointment  Kadin Tseng (kadin@bu.edu)  Doug Sondak (sondak@bu.edu)  help@twister.bu.edu, help@katana.bu.edu Please help us do better in the future by participating in a quick survey: http://scv.bu.edu/survey/spring11tut_survey.html


Download ppt "Spring 20111 MATLAB Tutorial Series Tuning MATLAB for Better Performance Kadin Tseng Scientific Computing and Visualization, IS&T Boston University."

Similar presentations


Ads by Google