© 2005 The MathWorks, Inc. Handling Large Data Sets Efficiently in MATLAB ® Stuart McGarrity The MathWorks, Inc.

Similar presentations


Presentation on theme: "© 2005 The MathWorks, Inc. Handling Large Data Sets Efficiently in MATLAB ® Stuart McGarrity The MathWorks, Inc."— Presentation transcript:

1 © 2005 The MathWorks, Inc. Handling Large Data Sets Efficiently in MATLAB ® Stuart McGarrity The MathWorks, Inc. stuart.mcgarrity@mathworks.com

2 2 Handling Large Data Sets is Like…

3 3 Agenda  Problems in handling large data sets  Strategies for handling large data set  Maximizing available memory on your system  Minimizing required memory in MATLAB ®

4 4 Problems in Handling Large Data Sets

5 5 What are Large Data Sets?  Data in MATLAB ® represent physical quantities  Large Data – Lots of quantities – Varying by time, space – High resolution – Example: 5-10 TB data per flight test  Trends: Devices, Computers, RAM, Hard drives

6 6 What are Large Data Handling Problems?  Running out of memory – Large data sets need lots of memory to store and process – Computers have finite memory – Data set size > available memory: “Out of memory” errors  Slowness – Large data sets need lots of operations to process, and access – Today's CPUs have limited speed – Slowness due to page file use

7 7 Causes of Out of Memory Error  Required memory > available memory – Memory constraints on (32-bit) computer system – Memory usage characteristics of MATLAB  Lack of understanding of memory constraints or requirements – >>A=rand(6e3,6e3);B=svd(A); ” Why out of memory?” – “I have 1 GB file but I have 3 GB of RAM. Why out of memory?”  Mistakes – >>a=rand(10000); % need 800MB storage

8 8 Example from CSSM: MATLAB Memory Limitation Problem!!!!!!!!!!! "erican" wrote in message news:...news: It seems Matlab has a bad memory management. I got a new PC with 4G physical memory and hoped to free my worry about memory usage. However, I found that I can only use about 1G memory for several big matrices. If I continue to create small matrix, I can use about 1.5G memory. Although I tried to change the system performance so that the swap space was set to the maximum, it still doesn't work. I want to maximize the usage of my 4G memory, at least to 2G. Any suggestions? Thanks!

9 9 MATLAB Users’ Data Set Sizes  25% of MATLAB users have data set sizes > 100 MB

10 10 MATLAB Users’ File Sizes  30% of MATLAB users access files > 100 MB

11 11 Strategies for Handling Large Data Sets

12 12 Strategies  Ensure available memory > required memory  Maximizing available memory on your system – System configuration  Minimize required memory in MATLAB – During access, storage, processing, plotting

13 13 Two Tactics Not Focused on Today  Use 64-bit – Removes one key memory constraint allowing processes to address many tera bytes of data.  Distributed computing – N Machines ~= N x Memory – Subset of all applications (data parallel)

14 14 Maximizing Available Memory on Your System

15 15 What’s the biggest array in MATLAB under Windows XP? >>a=zeros(?,1); >>whos a) 600MB b) 1GB c) 1.5GB

16 16 Memory Constraints Causing Out of Memory Errors Datazeros(1e9,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryrequirement } RAM Page File } Total System Memory

17 17 Total System Memory Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory

18 18 Total System Memory Available  Storage for all processes = – Physical RAM (fast and expensive) + – Page file on disk (cheap and slow)  Memory Management Guide Tech note 1106Tech note 1106  Amount of RAM affects performance; not direct cause of “out of memory” errors

19 19 Viewing Total System Memory Available and Usage: Task Manager  Alt-Ctrl-Del  Right-click task bar  Physical, commit charge  Process Explorer (Google “process explorer”)

20 20 Maximizing Total System Memory  Size – Ensure non-zero or system managed page file – Max 4 GB  Performance – Add RAM – Max 4 GB

21 21 The MATLAB Process’s Virtual Memory Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory

22 22 It is Limited and OS-Dependent  32-bit platforms – Windows 2000 and XP (by default): 2 GB – Linux/UNIX/MAC system configurable: 3-4 GB – Windows XP with /3gb boot.ini switch: 3 GB  64-bit platforms – Linux: 8TB (not all 64 bits used)

23 23 Maximizing The MATLAB Process’s Virtual Memory  Choose OS with largest process memory (in order): – 64-bit Linux, (future Win64) – 32-bit UNIX/Linux/MAC – Windows XP with /3gb – Window 2000, Windows XP (by default)

24 24 Checking The Virtual Memory Limit >>system_dependent memstats UNDOCUMENTED

25 25 Increasing Process Limit on XP to 3G: 3GB Switch  Right-click Properties> Advanced > Startup and Recovery > Edit  Make copy of [Operating system line], change comment and add /3gb  Reboot, select new OS option, check memstats

26 26

27 27

28 28

29 29

30 30

31 31

32

33 33

34 34

35 35

36 36

37 37

38 38 The MATLAB Process’s Virtual Memory Limit with 3GB Switch >>system_dependent memstats UNDOCUMENTED

39 39 Workspace Size and Largest Block Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory

40 40 Workspace Size and Largest Block  Workspace Size = 2/3GB minus: – System DLLs – Java – MATLAB.exe and DLLs  Largest block (for numerical arrays) – Fragmentation (Mainly on Windows) – Third party DLLs (e.g., Google Desktop, Fineprint) – Windows security updates

41 41 Finding Size of Workspace and Largest Block > >system_dependent memstats  Workspace size – 1.7 or 2.7 GB  Largest block – Goal 1.5 GB – Diagnose if less Atlantis UNDOCUMENTED

42 42 Diagnosing Memory (Workspace) Fragmentation >>system_dependent dumpmem  Shows MATLAB memory map and DLLs  Shows where largest blocks starts  Reveals causing fragmentation UNDOCUMENTED

43 43 Example: Print Driver and New System DLLs  Third party DLLs (e.g. Fineprint)  Windows Security Updates and service pack DLLs – Use movedlls.exe – www.mathworks.com/support (Solution Number: 1-1HE4G5) www.mathworks.com/support

44 44 Maximizing Largest Contiguous Block  Uninstall third-party tools if issue  Use XP SP2 and movedlls fix utility  Other techniques – Pack (save and load): Useful when using lots of variables after working for a while – Restart MATLAB

45 45 Final Memory Available on XP  32-bit Windows XP default – 1.7 GB total, 1.5 GB contiguous  32-bit Windows XP with /3gb switch – 2.7 GB Total, 1.5 GB contiguous  Rough guide what is possible – Can process 100s MB arrays with simple operations – Can process 10s MB arrays with complex operations, many operations, lines of codes

46 46 What’s the biggest real array in MATLAB under 64-bit Linux? >>a=zeros(?,1); >>whos a) 2 GB b) 16 GB c) 8 TB

47 47 64-bit Linux MATLAB  Workspace size limit: TB’s  Single array number of elements size limit  2e9 elements (2^31-2), mxarray limitation

48 48 Summary: Maximizing Available Memory  Maximize total system memory and performance – Use non-zero page file and have lots RAM  Minimize the system memory requirements – Close other applications  Maximize MATLAB process’s virtual memory – Use best OS or configure  Maximize largest contiguous block – Diagnose fragmentation

49 49 Minimizing Required Memory in MATLAB

50 50 Minimizing Memory Requirements  Data access  Data storage  Processing  Plotting

51 51 Application Problem Description  Semiconductor wafer thickness test data – Six months of production, millions of wafers  Problem – What percentage of the wafers manufactured last month meet thickness specifications?  Thickness data – Large text waferdata.csv – Nine position plus other information – Try import – View contents waferdata_start.csv

52 52 Data Access

53 53 Take Only What You Need  Take only what you need for the calculation – Not usually problem with databases – Common problem with big flat files  Consider block processing – Independent blocks – State saved (e.g., filtering)  Tip: Clear variable first

54 54 For Text Files Use textscan  Take rows and columns you need  For example: data = textscan(fid, format, N, ‘delimiter’,’,’) – Select columns with format string – Select rows with N  Returns cells for each data type in format string. You need to convert to doubles.  Only read in nine columns and one month (1e6 rows) Exercise 5

55 55 Binary Files: Try memmapfile  Map a section of a binary file into memory  Benefits – Faster than fread and fwrite – Access files with MATLAB indexing operations  Other – Random access to sections – Can have multiple views – Take from MATLAB memory  Example – Simple homogenous file – Mixed data types and access as arrays

56 56 Memory Mapping Example % Default, map whole file as uint8 m=memmapfile('waferdata_uint8.bin') m.Data(1:20); % Specify format and name m=memmapfile('waferdata_uint8.bin','format',{ 'uint8' [20 100] 'x'},'repeat',20*1000) A=m.Data; % Change format on the fly m.format={'uint8' [1 4] 'headerbits';... 'uint8' [4 9] 'middle';... 'uint8' [7 1] 'tail'}; A=m.Data;

57 57 Data Storage

58 58 Use Smallest Data Type  Depends on intended actions  Complicated Math (e.g., Linear Algebra) – Doubles or singles, 8 or 4 bytes – For example: a=single(7)  Simple arithmetic and original data is integers – Integers, 1-4 bytes, for example e.g. a=int8(7) – Can be faster than doubles – Try with waferdata (Exercise 6)  Sparse – Just non-zero values and index stored >>a = sparse(2e9,1,pi); Exercise 6a cell execute

59 59 Use Smallest Data Type (cont.)  Categories, dates – Cell arrays of strings, 60 byte header for each element – Use sparingly  Contiguousness – Numeric arrays must be contiguous – Cell arrays and structures do not  Comparison with C – For numerical processing, similar choice of data types

60 60 Load and Store as uint8  Read in uint8 – Note: when preallocating need to specify data types Exercise 6

61 61 Processing

62 62 Memory for Processing  Calculate only the results you need  MATLAB operators and functions need extra memory storage – Passing data to functions by value and assignments – Copy on write  Makes MATLAB safer, easier to debug – More memory than in-place operations using pointers

63 63 Monitoring MATLAB Memory Usage  Process Explorer or Task Manager  MATLAB Monitoring Tool: www.MATLABcentral.com www.MATLABcentral.com  MATLAB Central > File Exchange > Utilities > Development Environment > MATLAB Monitoring Tool  Example: – x=rand(10e6,1); – y=x; y(2)=1.5; UNDOCUMENTED

64 64 Monitoring MATLAB Memory Usage (cont.)  Total memory usage: workspaces, graphics >>system_dependent(‘CheckMallocMemoryUsage’); ans = 6820440  Show results in bytes (not MBytes)  Starts with a few MB  To use it, set an environmental variable – MATLAB_MEM_MGR debug – In DOS window, C:\ setx MATLAB_MEM_MGR debug UNDOCUMENTED

65 65 How much temporary memory does data=data+1 require, where data is a 1 MB array take in an M-File? Run M-file containing: data=zeros(1e6,1,’int8’); data=data+1; a) 0 MB b) 1 MB c) 2 MB

66 66 JIT Vs Interpreter for Large Data Set Handling  Run loaddata  Offset in the thicknesses data needs to be corrected with data = data+5 – At command line – In M-File Run Loaddata, then exercise 7a UNDOCUMENTED

67 67 Minimizing Copies and Temporaries  Share data rather than pass to functions – Nested Functions – Global  Reduce size of array to scalars or smaller blocks – Memory copies are equal to size of array – Process with for loops, de-vectorize  Can be slower, so must trade off speed for less memory

68 68 What is the fastest way to process MATLAB matrices with for loops? a) Down the columns b) Along the rows c) Doesn't matter Exercise 7 UNDOCUMENTED

69 69 Example: De-Vectorizing  1D  2D Exercise 7

70 70 Find Percentage of Wafers Meeting Specification  What percentage of the wafers meet specification – Must be < Maximum thickness of 200 – Must be > Minimum thickness of 70 exercise 8a, 8c

71 71 Plotting

72 72 How much extra memory do you need to plot a 10 MB double array? >>x=rand(125e4,1); >>plot(x); a) 10 MB b) 20 MB c) 40 MB Exercise 9a, process explorer on MATLAB task

73 73 Memory Copies When Plotting  Need extra memory to plot – For example: >>plot(data(:,1));  plot(x,y) – Copy of x, y to xdata and ydata properties of line  plot(y) – Copy of y in ydata and indices in xdata, 1:size(y)  Results – Memory requirements triple at least (more temporarily) – Integers more (stored as doubles)

74 74 Minimize Copies: Plot Only What You Need  Limited resolution of screen/human eye  Sub-select, down-sample – Plot every Nth element – Plot min/max in each block

75 75 Minimize Memory Requirements: Review  Accessing data – Take only what you need – Try processing in blocks  Data Storage – Store in smallest data type  Processing – Reduce temporaries with loops, blocking, nested, globals  Plotting – Plot what you need

76 76 Summary: Strategies For Handling Large Data Sets  Ensure available memory > required memory  64-bit and distributed computing  32-bit single CPU – Maximize available memory – Minimize required memory

77 77 Appendix

78 78 Appendix Contents  Resources  1 MB is Not Equal to Million Bytes  Setting the Paging File Size  File Size vs. Data Size  Minimizing Other Applications’ Memory

79 79 Resources  Technical note 1106 – www.mathworks.com/support, enter 1106  “Large Data Set Handling in MATLAB 7” Digest Article November 2004: www.mathworks.com/company/newsletters/digest/nov0 4/newfeatures.html

80 80 1 MB is Not Equal to Million Bytes  1 KB= 2^10 bytes = 1024 bytes – Not 1e3 bytes  1 MB=2^20 bytes = 1048576 bytes – Not 1e6 bytes  1 GB=2^30 bytes = 1073741824 bytes – Not 1e9 bytes

81 81 Setting the Paging File Size  My Computer > Properties > Advanced > Performance > Advanced > Virtual memory, Change then press “Set”  Set to System Managed Size or Custom  Slow down with page file, swapping

82 82 File Size vs. Data Size  Binary files: File size = data size – Assuming you use same data type as in the binary files  Text files: File size ~= data size, for example: – Data element: double= 8 bytes – Text file element, format 1: 3.4, 4 chars = 4 bytes – Text file element, format 2: 3.47896e+001, 13 chars= 13 bytes

83 83 Minimizing Other Applications’ Memory  Shut them down  Restart Computer, if can’t get < 300 MB  Use windows msconfig to avoid starting up applications  Not too important if you have a page file


Download ppt "© 2005 The MathWorks, Inc. Handling Large Data Sets Efficiently in MATLAB ® Stuart McGarrity The MathWorks, Inc."

Similar presentations


Ads by Google