Textscan with very large .dat files, Matlab keeps crashing

Hey everyone,
I am using R2013b to read in some very large files, 35 of them but I am running into memory problems and matlab usually crashes before loading the files in. I am using textscan but I was hoping someone could help me edit the code so that it will load in the data a block at a time or at least make it less memory intensive. I need all the years in one large cell array.
Any ideas?
Many thanks!
tic;
HWFiles = {'midas_wind_197901-197912.txt','midas_wind_198001-198012.txt', ..........(up to 2013)};
HWData = cell(1,numel(HWFiles));
for i=1:numel(HWFiles);
fid = fopen(HWFiles{i}, 'r');
tmp = textscan(fid,'%s %*s %*f %*f %s %*f %f %*f %f %f %f %f %f %*f %*f %*f %*f %*f %*s %*f %*f %*s %*f %*f', 'Delimiter',',');
HWData{i} = tmp ;
fclose(fid);
end
toc;

 採用された回答

Walter Roberson
Walter Roberson 2014 年 4 月 24 日

0 投票

Run through the files, reading them with textscan(), but instead of storing the results all in memory, use the matFile class to append the new data to the end of a variable in a .mat file.
Once that is done, you can start a new MATLAB session and load() the .mat file to get the combined cell array.

1 件のコメント

mashtine
mashtine 2014 年 4 月 24 日
Hey Walter, that definitely sounds like it will do the trick. I am unfamiliar with the script for appending with matfile though (just the basics of it), how would that look roughly using any example in a loop.
Thanks!

サインインしてコメントする。

その他の回答 (3 件)

per isakson
per isakson 2014 年 4 月 24 日
編集済み: per isakson 2014 年 4 月 24 日

1 投票

  • "load in the data a block at a time" . Block is that a part of a file?
  • converting from double to single saves on memory
  • your format specifier shows that you already skip many columns, "%*f"
  • "very large files" . Do these files contain hourly weather data? How large are they?
With textscan you can read N lines at a time
C = textscan( fileID, formatSpec, N )
But that will probably not help.
I imagine there are many possibilities to decrease the requirement for memory. However,
  • what data do you need to have simultaneously in memory to do the calculations?
  • could the files be downloaded from the net? Or could you attach a file to the question?

8 件のコメント

mashtine
mashtine 2014 年 4 月 24 日
Hey,
Yes, it is hourly data and unfortunately I need all the years together so that I can then extract the individual station data. They are files with about 1.2 million rows for each year.
per isakson
per isakson 2014 年 4 月 24 日
編集済み: per isakson 2014 年 4 月 24 日
Are you doing a new BEST-project :-)
1.2e6 rows x 35 years x 6*8 bytes = 2GB
That's not that bad. (I picked "6" from your format string. "35" is from 1979 to now.) And 1GB with single. Am I right?
I assume you use 64bit Matlab.
I would use a HDF5-file rather than the matFile class, because reading is faster. But I'm not convinced it is needed.
mashtine
mashtine 2014 年 4 月 25 日
Hey,
No haha, no Best-project, just research but its 1.2e6 rows x 8 columns and then for 35 years so the saved file is actually 30 GB. You are saying to load with textscan in single? and how would I go about using matfile, not too sure on that one.
Thanks for the help btw!
per isakson
per isakson 2014 年 4 月 25 日
編集済み: per isakson 2014 年 4 月 25 日
Thus I'm a factor ten off, but I cannot see what's wrong with
>> ( 1.2e6 * 8 * 35 * 8 )/1e9
ans =
2.6880
This is the size of the data when stored in ram or a binary file. (A little must be added for overhead and "timestamps".)
  • "the saved file" , which file do you refer to?
  • "1.2e6 rows" , is that one year of hourly data from 1.2e6/8760=134 met-stations?
It is definitely a good idea to read the ascii-files once and store the data in an appropriate type of binary file. Alternatives
  • save and load to an ordinary mat-file (version '-v6', '-v7',or '-v7.3'). v6 is fastest and not compressed.
  • matfile with mat-file version v7.3
  • h5write and h5read with a HDF-file
  • memmapfile with plain binary files
If you want to read full years of data I think the first alternative is good enough. The three other allow indexed reading, but how valuable is that?
Exactly how you want to store data depends on how you will use it.
José-Luis
José-Luis 2014 年 4 月 25 日
Or if you're going to be consistently working with large amounts of data, save yourself some future frustrations and use a database program. While almost everything is doable in Matlab, maybe it is not a good idea to do everything with it.
per isakson
per isakson 2014 年 4 月 25 日
編集済み: per isakson 2014 年 4 月 25 日
That depends on
  • how you want to use the data and
  • what you mean by database program
I'm positive the a HDF5 file is "better" than a "SQL-database" for storage and retrieval of this type time series, e.g. weather data. The typical queries are very simple and return full or large part of time series.
The HDF5-file is close to what is sometime called a "tagged" database in the process industry.
IMO: the high level support of HDF5 is good in the current Matlab release.
José-Luis
José-Luis 2014 年 4 月 25 日
編集済み: José-Luis 2014 年 4 月 25 日
Sure, HDF is fine, as long as you don't need relational capabilities in your database. That seems to be the case for the op.
By database program I meant whatever is designed to handle large amounts of data. I have no idea how the bindings are between Matlab and HDF since I have never tried them. Only NetCDF, a long time ago. I was not impressed.
What I meant by my comment is that you really shouldn't use Matlab to store and handle large amounts of data. It will be slow and your computer will choke really fast.
per isakson
per isakson 2014 年 4 月 25 日
編集済み: per isakson 2014 年 4 月 27 日
"designed to handle large amounts of data" . HDF5 complies to that definition. HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.
To discuss data storage, we need a better description of the use case than the one provided by OP.
I did an evaluation regarding storing time series from building automation systems and I settled on HDF5. And I'm happy with that choice.

サインインしてコメントする。

Justin
Justin 2014 年 4 月 24 日
編集済み: Justin 2014 年 4 月 24 日

0 投票

One thing that might help is increasing the Java Heap Memory. Go into the Home tab > Preferences > General > Java Heap Memory
The default is 128, try something conservative first such as 256. If you set it too high you will have to manually edit some config files before Matlab can start again so increment it slowly.
Another option is doing your analysis on each file separately or pulling out only the needed data from each file one at a time so the entire contents of all the files do not need to remain in memory.
EDIT:
For different file reading options you could also use readtable or if you have the statistics toolbox there is dataset.
Jeremy Hughes
Jeremy Hughes 2017 年 3 月 13 日
編集済み: Jeremy Hughes 2017 年 3 月 13 日

0 投票

Hi, If you can access R2014b or later, I'd recommend using DATASTORE to manage your import. It automatically breaks up files into blocks and manages multiple files.
ds = datastore(folder)
% List the names you want to import. e.g. ds.SelectedVariableNames = ds.VariableNames([1 3 5]);
ds.SelectedVariableNames = ...;
while hasdata(ds)
t = read(DS);% returns a table with the data for the current block.
% do stuff
end
This should do what you need. https://www.mathworks.com/help/matlab/datastore.html

カテゴリ

ヘルプ センター および File ExchangeLarge Files and Big Data についてさらに検索

質問済み:

2014 年 4 月 24 日

編集済み:

2017 年 3 月 13 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by