Download presentation
Presentation is loading. Please wait.
Published byJesse Lindsey Modified over 9 years ago
1
© 2005 The MathWorks, Inc. Handling Large Data Sets Efficiently in MATLAB ® Stuart McGarrity The MathWorks, Inc. stuart.mcgarrity@mathworks.com
2
2 Handling Large Data Sets is Like…
3
3 Agenda Problems in handling large data sets Strategies for handling large data set Maximizing available memory on your system Minimizing required memory in MATLAB ®
4
4 Problems in Handling Large Data Sets
5
5 What are Large Data Sets? Data in MATLAB ® represent physical quantities Large Data – Lots of quantities – Varying by time, space – High resolution – Example: 5-10 TB data per flight test Trends: Devices, Computers, RAM, Hard drives
6
6 What are Large Data Handling Problems? Running out of memory – Large data sets need lots of memory to store and process – Computers have finite memory – Data set size > available memory: “Out of memory” errors Slowness – Large data sets need lots of operations to process, and access – Today's CPUs have limited speed – Slowness due to page file use
7
7 Causes of Out of Memory Error Required memory > available memory – Memory constraints on (32-bit) computer system – Memory usage characteristics of MATLAB Lack of understanding of memory constraints or requirements – >>A=rand(6e3,6e3);B=svd(A); ” Why out of memory?” – “I have 1 GB file but I have 3 GB of RAM. Why out of memory?” Mistakes – >>a=rand(10000); % need 800MB storage
8
8 Example from CSSM: MATLAB Memory Limitation Problem!!!!!!!!!!! "erican" wrote in message news:...news: It seems Matlab has a bad memory management. I got a new PC with 4G physical memory and hoped to free my worry about memory usage. However, I found that I can only use about 1G memory for several big matrices. If I continue to create small matrix, I can use about 1.5G memory. Although I tried to change the system performance so that the swap space was set to the maximum, it still doesn't work. I want to maximize the usage of my 4G memory, at least to 2G. Any suggestions? Thanks!
9
9 MATLAB Users’ Data Set Sizes 25% of MATLAB users have data set sizes > 100 MB
10
10 MATLAB Users’ File Sizes 30% of MATLAB users access files > 100 MB
11
11 Strategies for Handling Large Data Sets
12
12 Strategies Ensure available memory > required memory Maximizing available memory on your system – System configuration Minimize required memory in MATLAB – During access, storage, processing, plotting
13
13 Two Tactics Not Focused on Today Use 64-bit – Removes one key memory constraint allowing processes to address many tera bytes of data. Distributed computing – N Machines ~= N x Memory – Subset of all applications (data parallel)
14
14 Maximizing Available Memory on Your System
15
15 What’s the biggest array in MATLAB under Windows XP? >>a=zeros(?,1); >>whos a) 600MB b) 1GB c) 1.5GB
16
16 Memory Constraints Causing Out of Memory Errors Datazeros(1e9,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryrequirement } RAM Page File } Total System Memory
17
17 Total System Memory Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory
18
18 Total System Memory Available Storage for all processes = – Physical RAM (fast and expensive) + – Page file on disk (cheap and slow) Memory Management Guide Tech note 1106Tech note 1106 Amount of RAM affects performance; not direct cause of “out of memory” errors
19
19 Viewing Total System Memory Available and Usage: Task Manager Alt-Ctrl-Del Right-click task bar Physical, commit charge Process Explorer (Google “process explorer”)
20
20 Maximizing Total System Memory Size – Ensure non-zero or system managed page file – Max 4 GB Performance – Add RAM – Max 4 GB
21
21 The MATLAB Process’s Virtual Memory Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory
22
22 It is Limited and OS-Dependent 32-bit platforms – Windows 2000 and XP (by default): 2 GB – Linux/UNIX/MAC system configurable: 3-4 GB – Windows XP with /3gb boot.ini switch: 3 GB 64-bit platforms – Linux: 8TB (not all 64 bits used)
23
23 Maximizing The MATLAB Process’s Virtual Memory Choose OS with largest process memory (in order): – 64-bit Linux, (future Win64) – 32-bit UNIX/Linux/MAC – Windows XP with /3gb – Window 2000, Windows XP (by default)
24
24 Checking The Virtual Memory Limit >>system_dependent memstats UNDOCUMENTED
25
25 Increasing Process Limit on XP to 3G: 3GB Switch Right-click Properties> Advanced > Startup and Recovery > Edit Make copy of [Operating system line], change comment and add /3gb Reboot, select new OS option, check memstats
26
26
27
27
28
28
29
29
30
30
31
31
33
33
34
34
35
35
36
36
37
37
38
38 The MATLAB Process’s Virtual Memory Limit with 3GB Switch >>system_dependent memstats UNDOCUMENTED
39
39 Workspace Size and Largest Block Datarand(1e8,1) New Data Requirements Contiguous Free Block } Other Fragments MATLABWorkspace Other ML Variables Workspace ML footprint, Win DLLs } MATLAB Process virtual memory Limit OS dependant e.g. 2GB in Win2k Other Apps MATLABProcessVirtualmemory All Applications memoryRequirement } RAM Page File } Total System Memory
40
40 Workspace Size and Largest Block Workspace Size = 2/3GB minus: – System DLLs – Java – MATLAB.exe and DLLs Largest block (for numerical arrays) – Fragmentation (Mainly on Windows) – Third party DLLs (e.g., Google Desktop, Fineprint) – Windows security updates
41
41 Finding Size of Workspace and Largest Block > >system_dependent memstats Workspace size – 1.7 or 2.7 GB Largest block – Goal 1.5 GB – Diagnose if less Atlantis UNDOCUMENTED
42
42 Diagnosing Memory (Workspace) Fragmentation >>system_dependent dumpmem Shows MATLAB memory map and DLLs Shows where largest blocks starts Reveals causing fragmentation UNDOCUMENTED
43
43 Example: Print Driver and New System DLLs Third party DLLs (e.g. Fineprint) Windows Security Updates and service pack DLLs – Use movedlls.exe – www.mathworks.com/support (Solution Number: 1-1HE4G5) www.mathworks.com/support
44
44 Maximizing Largest Contiguous Block Uninstall third-party tools if issue Use XP SP2 and movedlls fix utility Other techniques – Pack (save and load): Useful when using lots of variables after working for a while – Restart MATLAB
45
45 Final Memory Available on XP 32-bit Windows XP default – 1.7 GB total, 1.5 GB contiguous 32-bit Windows XP with /3gb switch – 2.7 GB Total, 1.5 GB contiguous Rough guide what is possible – Can process 100s MB arrays with simple operations – Can process 10s MB arrays with complex operations, many operations, lines of codes
46
46 What’s the biggest real array in MATLAB under 64-bit Linux? >>a=zeros(?,1); >>whos a) 2 GB b) 16 GB c) 8 TB
47
47 64-bit Linux MATLAB Workspace size limit: TB’s Single array number of elements size limit 2e9 elements (2^31-2), mxarray limitation
48
48 Summary: Maximizing Available Memory Maximize total system memory and performance – Use non-zero page file and have lots RAM Minimize the system memory requirements – Close other applications Maximize MATLAB process’s virtual memory – Use best OS or configure Maximize largest contiguous block – Diagnose fragmentation
49
49 Minimizing Required Memory in MATLAB
50
50 Minimizing Memory Requirements Data access Data storage Processing Plotting
51
51 Application Problem Description Semiconductor wafer thickness test data – Six months of production, millions of wafers Problem – What percentage of the wafers manufactured last month meet thickness specifications? Thickness data – Large text waferdata.csv – Nine position plus other information – Try import – View contents waferdata_start.csv
52
52 Data Access
53
53 Take Only What You Need Take only what you need for the calculation – Not usually problem with databases – Common problem with big flat files Consider block processing – Independent blocks – State saved (e.g., filtering) Tip: Clear variable first
54
54 For Text Files Use textscan Take rows and columns you need For example: data = textscan(fid, format, N, ‘delimiter’,’,’) – Select columns with format string – Select rows with N Returns cells for each data type in format string. You need to convert to doubles. Only read in nine columns and one month (1e6 rows) Exercise 5
55
55 Binary Files: Try memmapfile Map a section of a binary file into memory Benefits – Faster than fread and fwrite – Access files with MATLAB indexing operations Other – Random access to sections – Can have multiple views – Take from MATLAB memory Example – Simple homogenous file – Mixed data types and access as arrays
56
56 Memory Mapping Example % Default, map whole file as uint8 m=memmapfile('waferdata_uint8.bin') m.Data(1:20); % Specify format and name m=memmapfile('waferdata_uint8.bin','format',{ 'uint8' [20 100] 'x'},'repeat',20*1000) A=m.Data; % Change format on the fly m.format={'uint8' [1 4] 'headerbits';... 'uint8' [4 9] 'middle';... 'uint8' [7 1] 'tail'}; A=m.Data;
57
57 Data Storage
58
58 Use Smallest Data Type Depends on intended actions Complicated Math (e.g., Linear Algebra) – Doubles or singles, 8 or 4 bytes – For example: a=single(7) Simple arithmetic and original data is integers – Integers, 1-4 bytes, for example e.g. a=int8(7) – Can be faster than doubles – Try with waferdata (Exercise 6) Sparse – Just non-zero values and index stored >>a = sparse(2e9,1,pi); Exercise 6a cell execute
59
59 Use Smallest Data Type (cont.) Categories, dates – Cell arrays of strings, 60 byte header for each element – Use sparingly Contiguousness – Numeric arrays must be contiguous – Cell arrays and structures do not Comparison with C – For numerical processing, similar choice of data types
60
60 Load and Store as uint8 Read in uint8 – Note: when preallocating need to specify data types Exercise 6
61
61 Processing
62
62 Memory for Processing Calculate only the results you need MATLAB operators and functions need extra memory storage – Passing data to functions by value and assignments – Copy on write Makes MATLAB safer, easier to debug – More memory than in-place operations using pointers
63
63 Monitoring MATLAB Memory Usage Process Explorer or Task Manager MATLAB Monitoring Tool: www.MATLABcentral.com www.MATLABcentral.com MATLAB Central > File Exchange > Utilities > Development Environment > MATLAB Monitoring Tool Example: – x=rand(10e6,1); – y=x; y(2)=1.5; UNDOCUMENTED
64
64 Monitoring MATLAB Memory Usage (cont.) Total memory usage: workspaces, graphics >>system_dependent(‘CheckMallocMemoryUsage’); ans = 6820440 Show results in bytes (not MBytes) Starts with a few MB To use it, set an environmental variable – MATLAB_MEM_MGR debug – In DOS window, C:\ setx MATLAB_MEM_MGR debug UNDOCUMENTED
65
65 How much temporary memory does data=data+1 require, where data is a 1 MB array take in an M-File? Run M-file containing: data=zeros(1e6,1,’int8’); data=data+1; a) 0 MB b) 1 MB c) 2 MB
66
66 JIT Vs Interpreter for Large Data Set Handling Run loaddata Offset in the thicknesses data needs to be corrected with data = data+5 – At command line – In M-File Run Loaddata, then exercise 7a UNDOCUMENTED
67
67 Minimizing Copies and Temporaries Share data rather than pass to functions – Nested Functions – Global Reduce size of array to scalars or smaller blocks – Memory copies are equal to size of array – Process with for loops, de-vectorize Can be slower, so must trade off speed for less memory
68
68 What is the fastest way to process MATLAB matrices with for loops? a) Down the columns b) Along the rows c) Doesn't matter Exercise 7 UNDOCUMENTED
69
69 Example: De-Vectorizing 1D 2D Exercise 7
70
70 Find Percentage of Wafers Meeting Specification What percentage of the wafers meet specification – Must be < Maximum thickness of 200 – Must be > Minimum thickness of 70 exercise 8a, 8c
71
71 Plotting
72
72 How much extra memory do you need to plot a 10 MB double array? >>x=rand(125e4,1); >>plot(x); a) 10 MB b) 20 MB c) 40 MB Exercise 9a, process explorer on MATLAB task
73
73 Memory Copies When Plotting Need extra memory to plot – For example: >>plot(data(:,1)); plot(x,y) – Copy of x, y to xdata and ydata properties of line plot(y) – Copy of y in ydata and indices in xdata, 1:size(y) Results – Memory requirements triple at least (more temporarily) – Integers more (stored as doubles)
74
74 Minimize Copies: Plot Only What You Need Limited resolution of screen/human eye Sub-select, down-sample – Plot every Nth element – Plot min/max in each block
75
75 Minimize Memory Requirements: Review Accessing data – Take only what you need – Try processing in blocks Data Storage – Store in smallest data type Processing – Reduce temporaries with loops, blocking, nested, globals Plotting – Plot what you need
76
76 Summary: Strategies For Handling Large Data Sets Ensure available memory > required memory 64-bit and distributed computing 32-bit single CPU – Maximize available memory – Minimize required memory
77
77 Appendix
78
78 Appendix Contents Resources 1 MB is Not Equal to Million Bytes Setting the Paging File Size File Size vs. Data Size Minimizing Other Applications’ Memory
79
79 Resources Technical note 1106 – www.mathworks.com/support, enter 1106 “Large Data Set Handling in MATLAB 7” Digest Article November 2004: www.mathworks.com/company/newsletters/digest/nov0 4/newfeatures.html
80
80 1 MB is Not Equal to Million Bytes 1 KB= 2^10 bytes = 1024 bytes – Not 1e3 bytes 1 MB=2^20 bytes = 1048576 bytes – Not 1e6 bytes 1 GB=2^30 bytes = 1073741824 bytes – Not 1e9 bytes
81
81 Setting the Paging File Size My Computer > Properties > Advanced > Performance > Advanced > Virtual memory, Change then press “Set” Set to System Managed Size or Custom Slow down with page file, swapping
82
82 File Size vs. Data Size Binary files: File size = data size – Assuming you use same data type as in the binary files Text files: File size ~= data size, for example: – Data element: double= 8 bytes – Text file element, format 1: 3.4, 4 chars = 4 bytes – Text file element, format 2: 3.47896e+001, 13 chars= 13 bytes
83
83 Minimizing Other Applications’ Memory Shut them down Restart Computer, if can’t get < 300 MB Use windows msconfig to avoid starting up applications Not too important if you have a page file
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.