Store different data types efficiently

46 ビュー (過去 30 日間)
Moritz Scherrmann
Moritz Scherrmann 2020 年 10 月 29 日
編集済み: Moritz Scherrmann 2021 年 1 月 20 日
Hi all,
I have a dataset of company announcements. After parsing I end up with several variables for every announement, like the body text (string array), the company name (string), the announcement date (double), etc. Until now, I stored these variables for all documents in a large struct array ( I have 55000 documents, resulting in a huge struct array). Unfortunately, it takes very long to load this struct in the workspace. Additionally, Matlab gets very slow in this case. Do you have a recommendation how to solve this problem?
I would be very grateful for every hint.
Thank you!
  6 件のコメント
Mario Malic
Mario Malic 2020 年 10 月 29 日
編集済み: Mario Malic 2020 年 10 月 29 日
You can also take a look at datastore function that deals with large databases. This might be more efficient, especially when you only need some files in memory. Unfortunately, i can't give specific hints as I haven't worked with it yet.
Reading 55000 documents is probably what takes the most of the time, parallelisation would speed things up, if applicable.
J. Alex Lee
J. Alex Lee 2020 年 10 月 29 日
I also vote for sqlite if you need fast/random access into the data and don't mind needing to do a slow one-time import plus update insertions every time you get a new announcement file.
I've played a bit with datastore, but I find it's not really for performance (slow). And if your data is mostly text and you don't need to do aggregation computations or otherwise operate on large virtual arrays/tables in a native matlab-like language, it's not really clear to me (out of my own ignorance) that datastore will be any more useful than just creating an sqlite database.
I've tried to study how a non-mat HDF file could help me in my own application; even if you did achieve some kind of better control over the data saving, I have not been able to figure out how to do a random access of only chunks of the HDF file.

サインインしてコメントする。

採用された回答

per isakson
per isakson 2020 年 10 月 31 日
編集済み: per isakson 2020 年 11 月 2 日
Three solutions are mentioned in the comments of you question. I assume you have a collection of m-functions, which operate on the structure. They all have their pros and cons. I'll try to summeries my view.
I use R2018b, Win10, Intel i7, 32GB ram and data on a spinning HD.
===================================================================================
Matlab structure
Maybe, splitting the structure into one structure per year would help. If nothing else it will avoid the 2GB limit.
I made a little experiment on R2018b and might have to revise my old rule of thumb. v7.3 compares better than I anticipated.
%%
save( 'v60.mat' , 'nasa', '-v6' )
save( 'v70.mat' , 'nasa', '-v7' )
save( 'v73n.mat', 'nasa', '-v7.3', '-nocompression' )
save( 'v73c.mat', 'nasa', '-v7.3' )
%%
tic, S60 = load( 'v60.mat' ); toc
tic, S70 = load( 'v70.mat' ); toc
tic, S73n = load( 'v73n.mat' ); toc
tic, S73c = load( 'v73c.mat' ); toc
outputs
Elapsed time is 0.135304 seconds.
Elapsed time is 0.826981 seconds.
Elapsed time is 0.346619 seconds.
Elapsed time is 0.912298 seconds.
The size of the loaded structs are all 0.236 GB. The struct, nasa, was created in an experiment with HDF5. See below. v6 loads the fastest and v7.3 nocompression comes second. The differences are significant. The struct, nasa, contains mainly numerical data. The results with your struct, which is mainly text, might differ. Caveat: It's tricky to measure the time used to load files (especially smaller files), because (on Windows) one has no control over the content of the cache-system.
Pros
  • You don't need to adapt your m-functions.
  • No new tools to learn how to use
  • No extra cost in time or money
Cons
  • there is still a loading time
Added later
I repeated the test with a structure, S, which better represents OP's structure.
%% Create a struct, which is dominated by text. The number of "announcements" and
% the total size agrees with OP's structure.
chr = ['0':'9',' ','a':'z',' ','A':'Z'];
txt = @() chr(randi([1,numel(chr)],1,1.6e4));
%%
for jj = 1:55000
S.(sprintf('A%05d',jj)) = txt();
S.(sprintf('N%05d',jj)) = randi(1000,1,200);
end
%%
save( 'S60.mat' , 'S', '-v6' )
save( 'S70.mat' , 'S', '-v7' )
save( 'S73n.mat', 'S', '-v7.3', '-nocompression' )
save( 'S73c.mat', 'S', '-v7.3' )
%%
tic, S60 = load( 'S60.mat' ); toc
tic, S70 = load( 'S70.mat' ); toc
tic, S73n = load( 'S73n.mat' ); toc
tic, S73c = load( 'S73c.mat' ); toc
%
%% After having tested the code several times
% Elapsed time is 8.652943 seconds.
% Elapsed time is 7.342954 seconds.
% Elapsed time is 7.094169 seconds.
% Elapsed time is 15.339989 seconds.
% whos S*
% Name Size Bytes Class Attributes
%
% S 1x1 1867360000 struct
% S60 1x1 1867360176 struct
% S70 1x1 1867360176 struct
% S73c 1x1 1867360176 struct
% S73n 1x1 1867360176 struct
%
%% Three runs directly after a restart of Matlab
% Elapsed time is 16.041436 seconds.
% Elapsed time is 7.532245 seconds.
% Elapsed time is 45.075487 seconds.
% Elapsed time is 37.160575 seconds.
% clearvars
% Elapsed time is 10.310060 seconds.
% Elapsed time is 10.681408 seconds.
% Elapsed time is 10.978731 seconds.
% Elapsed time is 19.676905 seconds.
% clearvars
% Elapsed time is 9.340518 seconds.
% Elapsed time is 8.342259 seconds.
% Elapsed time is 8.600243 seconds.
% Elapsed time is 17.223713 seconds.
I don't understand why the result of the first run after restart of Matlab differs so much from the rest. However, it looks like v7.0 is the best alternative for a 2GB structure dominated by text.
===================================================================================
Caveat: I'm biased. I've made a Matlab class (a HDF-file wrapper), which writes and reads experimental time series to/from an HDF5-file. Before we loaded selected data into a structure and visualized the data with an interactive system. With the new class we skip the structure altogether and the visualisation tool gets data directly from the HDF-file. The HDF-file replaces the struct in our code. Admittedly, this new application is still in an experimental stage. To gain speed and avoid problems with varying length strings, I store text as uint8 and convert back and forth.
I found two tools in the File Exchange, which I tried.
  • HDF2Struct by Luca Amerio. In the first comment apple reported that the tool failed with a Nasa-file. I fixed the issue and read the file to a struct, which I call nasa.
  • EasyH5 by Qianqian Fang. This tool both reads and writes HDF5-files. Directly out of the box I loaded a 60MB "Nasa-file" and saved the result to a new h5-file. Both operations worked at first attempt. I inspected the result with HDFView 3.0.
Porting your system to HDF5 might include the following steps
  • Writing your struct to an HDF-file with EasyH5. I believe that will work without problems. The attribute feature of HFD5 will not be used.
  • Make some experiments to decide whether it's possible to reach your carefully formulated goals. Thanks to EasyH5 that should be possible to do in a limited amount of time.
  • Replace the struct in your m-code by the HDF-file. In the old days we referenced and assigned structs with the two functions getfield() and setfield(). The HDF read and write are very similar to these two Matlab functions. I think it's doable, but it might take more time than I hint here.
Added later
I performed an experiment to find out what kind of response times are possible with HDF5.
  • Create a 2GB structure, Num, containg 55000 "texts" each of 16000 random characters. The characters are represented by uint16. (EasyH5 stored character vectors with one character per element, which ruined the performance. Probably a user mistake.)
  • Save the structure, Num, with EasyH5.
  • Read separate texts with the Matlab function, h5read().
  • Convert the uint16 vector to character.
%% Store
chr = ['0':'9',' ','a':'z',' ','A':'Z'];
u16 = @() uint16(chr(randi([1,numel(chr)],1,1.6e4)));
%%
for jj = 1:55000
Num.(sprintf('A%05d',jj)) = u16();
Num.(sprintf('N%05d',jj)) = randi(1000,1,200);
end
%%
saveh5( Num, 'Num.h5', 'rootname','' ); % EasyH5
%%
val = getfield( Num, 'A01999' );
vh5 = h5read( 'Num.h5', '/A01999' );
vh5 = reshape( vh5, 1,[] );
%%
tic
vh5 = h5read( 'Num.h5', '/A51999' );
vh5 = h5read( 'Num.h5', '/A41999' );
vh5 = h5read( 'Num.h5', '/A31999' );
vh5 = h5read( 'Num.h5', '/A21999' );
vh5 = h5read( 'Num.h5', '/A11999' );
vh5 = h5read( 'Num.h5', '/A01999' );
toc
% Repeated three times
Elapsed time is 0.005964 seconds.
Elapsed time is 0.005729 seconds.
Elapsed time is 0.007962 seconds.
>> tic, txt = char(reshape(vh5,1,[])); toc
Elapsed time is 0.000155 seconds.
>> txt(1:64)
ans =
'sHsfkYBk WsxuNkqX8QWwWTp oGcqH18B07NdVOpfYoORfcO4cXphI6GqPo1q9vh'
The results indicates that loading one 16000 character announcement will take approximately 1.5 milliseconds. (Using uint8 instead of uint16 didn't increase the speed significantly.) Loading a 1x200 double vector takes approximately the same amount of time.
Reading data in this experiment is based on prior knowledge of the "path" to the item. It's much more complicated and time consuming to find items based on some characteristic of the data, e.g. the mean value of the double vector exceeds a certain threshold. (That's complicated enough with a structure too, but faster.)
===================================================================================
SQL (or other type of) database
Common wisdom says that 55000 "company announcements" should be stored in an SQL-database. Where did the struct come from? If your organisation already stores these announcements in a database somewhere, then a SQL-savvy person might want to help you getting started.
It's difficult to compare apples with oranges.
Before I made my HDF-file wrapper we set up an MySQL database, to store the same type of experimental time series. Large part of the job was done by a database consultant. We didn't reach my goals regarding reading speed and it didn't scale well. The time to read increased with the size of the database.
However, this has probably little relevance to your use case, but you need to start with some realistic timing experiments.
  1 件のコメント
Moritz Scherrmann
Moritz Scherrmann 2020 年 11 月 2 日
編集済み: Moritz Scherrmann 2021 年 1 月 20 日
Wow, thank you so much! This is a much more detailed and helpful answer than I expected. I really appreciate that.
Since we are updating frequently, in the long run, a database solution would probably be the best. However, your other two solutions sound very interesting. I think I'll try the solution with the structures first, since it requires the least changes in my code.

サインインしてコメントする。

その他の回答 (1 件)

Peter Perkins
Peter Perkins 2020 年 11 月 19 日
Moritz, if "I stored these variables for all documents in a large struct array" is literally true, then that's your problem. I mean, you can use HDF or datastore or whatever, but you should consider using a table in a mat file instead of a struct array in a mat file. The struct array (assuming you do not mean a scalar struct OF arrays) is not an efficient way to store homogeneous "records". Consider the following:
>> t = array2table(rand(55000,10));
>> s = table2struct(t);
>> whos t s
Name Size Bytes Class Attributes
s 55000x1 61600640 struct
t 55000x10 4402930 table
A factor of 10. I have no idea what your data really look like or how fast they would load as a table in a mat file, but it's worth looking at. You will also find that a table makes selecting subsets of your data much easier than a struct array.

カテゴリ

Help Center および File ExchangeText Files についてさらに検索

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by