Technical Articles

New Features in MATLAB 7 for Handling Large Data Sets

By Stuart McGarrity, MathWorks


MATLAB 7 introduces a number of enhancements to support large data set handling. These include improvements to file access, data storage efficiency, and data processing speed, as well as support for new 64-bit platforms.

The material and examples in this article use features of products in MathWorks Release 14 with Service Pack 1.

Large Data Set Handling Issues

Solving technical computing problems that require processing and analyzing large amounts of data puts a high demand on your computer system. Large data sets take up significant memory during processing and can require many operations to compute a solution. It can also take a long time to access information from large data files.

Computer systems, however, have limited memory and finite CPU speed. Available resources vary by processor and operating system, the latter of which also consumes resources. For example:

  • 32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory (also known as virtual address space).
  • Windows XP and Windows 2000 allocate only 2 GB of this virtual memory to each process (such as MATLAB). On UNIX, the virtual memory allocated to a process is system-configurable and is typically around 3 GB.
  • The application carrying out the calculation, such as MATLAB, can require storage in addition to the user task.

The main problem when handling large amounts of data is that the memory requirements of the program can exceed that available on the platform. For example, MATLAB generates an “out of memory” error when data requirements exceed approximately 1.7 GB on Windows XP.

The following sections describe a number of enhancements in MATLAB 7 that help address large data set handling, including increased available memory, improved file access, more efficient data storage, and increased processing performance.

Maximizing Available Memory

New 64-bit Platforms

A 64-bit version of MATLAB is now available for Linux platforms based on AMD64 and Intel EM64T processors. A 64-bit processor provides a very large amount of available memory, up to 2^64 bytes = 18,446,744,073,709,552, 000 bytes (16 exabytes), enabling you to store a very large amount of information. For example, the Google search engine currently uses 2 petabytes of disc space. With 16 exabytes, you could fit 9,000 Googles into memory.

Platforms with 64-bit architecture solve the problem of memory limitation for handling today's large data sets, but do not address other issues such as execution and file I/O speed.

Note: In MATLAB on 64-bit platforms, the size of a single matrix is currently limited to 2^32 elements such as a square matrix of 65,000x 65,000, consuming 16 GB.

Memory Enhancements for Windows XP

MATLAB 7 increases the largest contiguous block of memory under Windows XP to approximately 1.5 GB, equivalent to 180 million double precision values.

Also, on Windows XP, MATLAB now supports the 3GB switch boot option, allocating an additional 1 GB of addressable memory to each process. This increases the total amount of data you can store in the MATLAB workspace to approximately 2.7 GB. This is equivalent to 330 million double precision values. This additional block of memory is not contiguous with the rest of the memory MATLAB uses so you cannot create a single array to fill this space.

Viewing Available Memory

To see what memory is available in MATLAB 7 on a Windows system, use the following command.

 feature('memstats')

The example below shows the results for a 1.2-GB RAM Windows XP system with the 3-GB switch set. You can see two large memory blocks of more than 1 GB each with a total of 2.7 GB available.

Physical Memory (RAM):
In Use:             340 MB    (1549f000)
Free:               938 MB    (3aa4d000)
Total:              1278 MB   (4feec000)
Page File (Swap space):
In Use:             236 MB    (0ec78000)
Free:               986 MB    (3dad9000)
Total:              1223 MB   (4c751000)
Virtual Memory (Address Space):
In Use:             296 MB    (1283d000)
Free:               2775 MB   (ad7a3000)
Total:              3071 MB   (bffe0000)
Largest Contiguous Free Blocks:	 	 	 	 
1. [at 10007000]    1546 MB   (60a69000)
2. [at 7ffe1000]    1023 MB   (3ffbf000)
3. [at 7c41b000]    28 MB     (01c75000)
4. [at 74764000]    28 MB     (01c2c000)
5....	 	 	 	 
                    =======   ==========
                    2734 MB   (aae1a000)

You must install sufficient physical memory (RAM) on the computer to cover your data storage needs. Doing so minimizes paging the data to disk, which can substantially degrade performance.

For more information on maximizing the available memory in MATLAB see the Memory Management Guide.

Data Access

Text File Reading

The new textscan function enables you to access very large text files that have arbitrary format. This function is similar to textread but adds the ability to specify a file identifier so that a file pointer can be tracked and traversed through the file. The file can therefore be read a block at a time, changing the format on each occasion.

For example, suppose we have a text file, test12_80211b.txt, which contains multiple different-sized blocks of data, each with the following format:

  • Two headerlines of description
  • A parameter m
  • A p x m table of data

Here is how test12_80211b.txt looks:

*       Mobile1
*       SNR Vs test No
Num tests=19
,-5.00E+00,-4.00E+00,-3.00E+00,-2.00E+00,...
1.00E+00,6.19E-07,8.63E-07,6.43E-07,1.84E-07,...
2.00E+00,2.88E-07,4.71E-07,6.92E-07,1.43E-07,...
3.00E+00,2.52E-07,8.11E-07,4.74E-07,8.48E-07,...
4.00E+00,...
...
 
*       Mobile2
*       SNR Vs test No
Num tests=20
,-5.00E+00,-4.00E+00,-3.00E+00,-2.00E+00,-1.00E+00,0.00E+00,...
1.00E+00,6.19E-07,8.63E-07,6.43E-07,1.84E-07,6.86E-07,3.73E-,...
2.00E+00,...

You could use the following MATLAB commands to read it in:

fid = fopen('test12_80211b.txt', 'r'); % Open text file
InputText = textscan(fid, '%s', 2, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}

HeaderLines = 
         '* Mobile1'
         '* SNR Vs test No'
InputText = textscan(fid, 'Num tests=%f'); % Read parameter value
NumCols=InputText{1}

NumCols = 
         19
InputText=textscan(fid, '%f', 'delimiter', ','); % Read data block
Data=reshape(InputText{1},[],NumCols)';
format short g
Section=Data(1:5,1:5)

Section = 
       NaN    -5             -4             -3             -2
       1      6.19e-007      8.63e-007	    6.43e-007      1.84e-007
       2      2.88e-007      4.71e-007      6.92e-007      1.43e-007
       3      2.52e-007      8.11e-007      4.74e-007      8.48e-007
       4      1.97e-007      1.64e-007      1.38e-007      6.17e-007

For improved data access speed, in this release of MATLAB the reading of comma-separated-value (CSV) files is an order of magnitude faster.

MAT File Compression

The save command in MATLAB 7 now compresses the data before writing the MAT file to disk. This results in smaller files for compressible (non-random) data sets and faster reading for very large data files over a network.

Data Storage Efficiency

MATLAB 7 now provides integer and single-precision math. This new capability enables processing of integer and single-precision data in its native type, resulting in more efficient memory usage and the ability to process larger, nondouble data sets.

For example, you can process up to 8 times as many 8-bit integer values when stored natively then if cast and stored as doubles. So on Windows XP (without the 3-GB switch), you could read in a file of 8-bit integer values of up to 1.5 GB in size as compared to the previous limit of 180 MB when you were required to store the data as doubles. (This is a theoretical maximum and there would be no space available to save the answers to any operations.) See the July 2004 MATLAB Digest article “Integer and Single-Precision Math in MATLAB 7” for more information.

Data Processing Performance

Improved Execution Speed

MATLAB 7 introduced a number of processing speed enhancements for faster execution of large dataset problems. These include optimized Basic Linear Algebra Subprograms (BLAS) libraries provided by the vendors of processors used in most of the platforms MATLAB supports, including the Intel® Math Kernel Library (MKL) and the BLAS library available through the Accelerate framework on the Macintosh. Also, the latest version of the FFTW (3.0) routines is used for maximum speed execution of FFT tasks.

The JIT Accelerator now covers all numeric data types, such as complex variables, and function calls (when called from a function), increasing the speed of more of the MATLAB language. It also generates MMX instructions for optimized execution of integer operations. In the case of 8-bit integers, this results in execution that is up to 8 times faster than doubles.

New Large Data Set Handling Features

Other new features that support the processing of large data sets include:

  • The ability to view larger numerical arrays in the array editor (up to 500,000 elements) during interactive data analysis
  • Nested functions, allowing the inner function to see the workspace of the parent. This feature lets you share large data sets between functions, such as in a GUI, without having to use global variables or pass the data by value as function parameters. In the example below, the nested function process can see the variables, such as street1, in the workspace of the parent function percentNonzero.
function y = percentNonzero(filename, scalevalue, thresholdvalue)
%PERCENTNONZERO Calculate the percentage of non-zero elements
% P = PERCENTNONZERO('FILENAME', SCALEVALUE, THRESHOLDVALUE)
% returns the percentage of non-zero elements in an image read
% from the file FILENAME, scaled by the value SCALEVALUE and
% thresholded at a value of THRESHOLDVALUE.
%
% Example:
% p=percentNonzero('street1.jpg',1.5,140);
 
street1 = imread(filename); % Read image from file
process(scalevalue, thresholdvalue); % Scale and threshold image
 
% Find percentage of non-zero elements
y = 100 * sum(street1(:))/numel(street1);
 
function process(scaleval, threshval)
% Scale image
street1 = street1 * scaleval;
 
% Threshold image to create logical array
street1 = street1 > threshval;
end
end
  • The new M-Lint Code Checker reports unused variables that you can remove to minimize your code's memory usage. For more about M-Lint, see the article “Clean Up Your Code!” in the December 2004 issue of MATLAB News and Notes.

Summary

A collection of new tools and capabilities in MATLAB 7 enables you to handle larger data sets, letting you take on larger and more complex engineering and science problems and solve them in less time.

Published 2004

Products Used