Store different data types efficiently

Question

Moritz Scherrmann 2020 年 10 月 29 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/630033-store-different-data-types-efficiently

編集済み: Moritz Scherrmann 2021 年 1 月 20 日

Hi all,

I have a dataset of company announcements. After parsing I end up with several variables for every announement, like the body text (string array), the company name (string), the announcement date (double), etc. Until now, I stored these variables for all documents in a large struct array ( I have 55000 documents, resulting in a huge struct array). Unfortunately, it takes very long to load this struct in the workspace. Additionally, Matlab gets very slow in this case. Do you have a recommendation how to solve this problem?

I would be very grateful for every hint.

Thank you!

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Moritz Scherrmann 2020 年 10 月 29 日

編集済み: Moritz Scherrmann 2020 年 10 月 29 日

Thank you very much for your quick response!

Yes, the announcements consist only of text and numbers.
I use the structure in several ways. One step is for example to use the announcements dates and the company ISINs for an event study regression on the respective stock data to get market implied sentiment labels of the announcements. Another step is to embed the announcements into vector spaces using sentence-BERT, which allows clustering the announcements into topics. More broadly, I have a specific workflow, where I need different parts of the structure in different steps. This means that I definitely do not need the entire "database" in memory simultaneously. I only need access to one or more variables of the dataset for one step.
I use mat-file version '-v7.3'
Thank you for your hints regarding HDF and SQLite. I will figure that out. Having the new information of this answer in mind, which option would you try first?

Mario Malic 2020 年 10 月 29 日

編集済み: Mario Malic 2020 年 10 月 29 日

You can also take a look at datastore function that deals with large databases. This might be more efficient, especially when you only need some files in memory. Unfortunately, i can't give specific hints as I haven't worked with it yet.

Reading 55000 documents is probably what takes the most of the time, parallelisation would speed things up, if applicable.

J. Alex Lee 2020 年 10 月 29 日

I also vote for sqlite if you need fast/random access into the data and don't mind needing to do a slow one-time import plus update insertions every time you get a new announcement file.

I've played a bit with datastore, but I find it's not really for performance (slow). And if your data is mostly text and you don't need to do aggregation computations or otherwise operate on large virtual arrays/tables in a native matlab-like language, it's not really clear to me (out of my own ignorance) that datastore will be any more useful than just creating an sqlite database.

I've tried to study how a non-mat HDF file could help me in my own application; even if you did achieve some kind of better control over the data saving, I have not been able to figure out how to do a random access of only chunks of the HDF file.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

per isakson 2020 年 10 月 31 日

3
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/630033-store-different-data-types-efficiently#answer_529784

編集済み: per isakson 2020 年 11 月 2 日

MATLAB Online で開く

Three solutions are mentioned in the comments of you question. I assume you have a collection of m-functions, which operate on the structure. They all have their pros and cons. I'll try to summeries my view.

I use R2018b, Win10, Intel i7, 32GB ram and data on a spinning HD.

===================================================================================

Matlab structure

Maybe, splitting the structure into one structure per year would help. If nothing else it will avoid the 2GB limit.

I made a little experiment on R2018b and might have to revise my old rule of thumb. v7.3 compares better than I anticipated.

%%
save( 'v60.mat' , 'nasa', '-v6' )
save( 'v70.mat' , 'nasa', '-v7' )
save( 'v73n.mat', 'nasa', '-v7.3', '-nocompression' )
save( 'v73c.mat', 'nasa', '-v7.3' )
%%
tic, S60  = load( 'v60.mat'  ); toc
tic, S70  = load( 'v70.mat'  ); toc
tic, S73n = load( 'v73n.mat' ); toc
tic, S73c = load( 'v73c.mat' ); toc

outputs

Elapsed time is 0.135304 seconds.
Elapsed time is 0.826981 seconds.
Elapsed time is 0.346619 seconds.
Elapsed time is 0.912298 seconds.

The size of the loaded structs are all 0.236 GB. The struct, nasa, was created in an experiment with HDF5. See below. v6 loads the fastest and v7.3 nocompression comes second. The differences are significant. The struct, nasa, contains mainly numerical data. The results with your struct, which is mainly text, might differ. Caveat: It's tricky to measure the time used to load files (especially smaller files), because (on Windows) one has no control over the content of the cache-system.

Pros

You don't need to adapt your m-functions.
No new tools to learn how to use
No extra cost in time or money

Cons

there is still a loading time

Added later

I repeated the test with a structure, S, which better represents OP's structure.

%% Create a struct, which is dominated by text. The number of "announcements" and 
% the total size agrees with OP's structure. 
chr = ['0':'9',' ','a':'z',' ','A':'Z'];
txt = @() chr(randi([1,numel(chr)],1,1.6e4));
%%
for jj = 1:55000
    S.(sprintf('A%05d',jj)) = txt();
    S.(sprintf('N%05d',jj)) = randi(1000,1,200); 
end
%%
save( 'S60.mat' , 'S', '-v6' )
save( 'S70.mat' , 'S', '-v7' )
save( 'S73n.mat', 'S', '-v7.3', '-nocompression' )
save( 'S73c.mat', 'S', '-v7.3' )
%%
tic, S60  = load( 'S60.mat'  ); toc
tic, S70  = load( 'S70.mat'  ); toc
tic, S73n = load( 'S73n.mat' ); toc
tic, S73c = load( 'S73c.mat' ); toc
%
%% After having tested the code several times
% Elapsed time is 8.652943 seconds.
% Elapsed time is 7.342954 seconds.
% Elapsed time is 7.094169 seconds.
% Elapsed time is 15.339989 seconds.
% whos S*
%   Name      Size                 Bytes  Class     Attributes
% 
%   S         1x1             1867360000  struct              
%   S60       1x1             1867360176  struct              
%   S70       1x1             1867360176  struct              
%   S73c      1x1             1867360176  struct              
%   S73n      1x1             1867360176  struct              
% 
%%  Three runs directly after a restart of Matlab 
% Elapsed time is 16.041436 seconds.
% Elapsed time is 7.532245 seconds.
% Elapsed time is 45.075487 seconds.
% Elapsed time is 37.160575 seconds.
% clearvars
% Elapsed time is 10.310060 seconds.
% Elapsed time is 10.681408 seconds.
% Elapsed time is 10.978731 seconds.
% Elapsed time is 19.676905 seconds.
% clearvars
% Elapsed time is 9.340518 seconds.
% Elapsed time is 8.342259 seconds.
% Elapsed time is 8.600243 seconds.
% Elapsed time is 17.223713 seconds.

I don't understand why the result of the first run after restart of Matlab differs so much from the rest. However, it looks like v7.0 is the best alternative for a 2GB structure dominated by text.

===================================================================================

HDF5

Caveat: I'm biased. I've made a Matlab class (a HDF-file wrapper), which writes and reads experimental time series to/from an HDF5-file. Before we loaded selected data into a structure and visualized the data with an interactive system. With the new class we skip the structure altogether and the visualisation tool gets data directly from the HDF-file. The HDF-file replaces the struct in our code. Admittedly, this new application is still in an experimental stage. To gain speed and avoid problems with varying length strings, I store text as uint8 and convert back and forth.

I found two tools in the File Exchange, which I tried.

HDF2Struct by Luca Amerio. In the first comment apple reported that the tool failed with a Nasa-file. I fixed the issue and read the file to a struct, which I call nasa.
EasyH5 by Qianqian Fang. This tool both reads and writes HDF5-files. Directly out of the box I loaded a 60MB "Nasa-file" and saved the result to a new h5-file. Both operations worked at first attempt. I inspected the result with HDFView 3.0.

Porting your system to HDF5 might include the following steps

Writing your struct to an HDF-file with EasyH5. I believe that will work without problems. The attribute feature of HFD5 will not be used.
Make some experiments to decide whether it's possible to reach your carefully formulated goals. Thanks to EasyH5 that should be possible to do in a limited amount of time.
Replace the struct in your m-code by the HDF-file. In the old days we referenced and assigned structs with the two functions getfield() and setfield(). The HDF read and write are very similar to these two Matlab functions. I think it's doable, but it might take more time than I hint here.

Added later

I performed an experiment to find out what kind of response times are possible with HDF5.

Create a 2GB structure, Num, containg 55000 "texts" each of 16000 random characters. The characters are represented by uint16. (EasyH5 stored character vectors with one character per element, which ruined the performance. Probably a user mistake.)
Save the structure, Num, with EasyH5.
Read separate texts with the Matlab function, h5read().
Convert the uint16 vector to character.

%% Store
chr = ['0':'9',' ','a':'z',' ','A':'Z'];
u16 = @() uint16(chr(randi([1,numel(chr)],1,1.6e4)));
%%
for jj = 1:55000
    Num.(sprintf('A%05d',jj)) = u16();
    Num.(sprintf('N%05d',jj)) = randi(1000,1,200); 
end
%%
saveh5( Num, 'Num.h5', 'rootname','' );     % EasyH5
%%
val = getfield( Num, 'A01999' );
vh5 = h5read( 'Num.h5', '/A01999' );
vh5 = reshape( vh5, 1,[] );
%%
tic
vh5 = h5read( 'Num.h5', '/A51999' );
vh5 = h5read( 'Num.h5', '/A41999' );
vh5 = h5read( 'Num.h5', '/A31999' );
vh5 = h5read( 'Num.h5', '/A21999' );
vh5 = h5read( 'Num.h5', '/A11999' );
vh5 = h5read( 'Num.h5', '/A01999' );
toc
% Repeated three times
Elapsed time is 0.005964 seconds.
Elapsed time is 0.005729 seconds.
Elapsed time is 0.007962 seconds.
>> tic, txt = char(reshape(vh5,1,[])); toc
Elapsed time is 0.000155 seconds.
>> txt(1:64)
ans =
    'sHsfkYBk WsxuNkqX8QWwWTp oGcqH18B07NdVOpfYoORfcO4cXphI6GqPo1q9vh'

The results indicates that loading one 16000 character announcement will take approximately 1.5 milliseconds. (Using uint8 instead of uint16 didn't increase the speed significantly.) Loading a 1x200 double vector takes approximately the same amount of time.

Reading data in this experiment is based on prior knowledge of the "path" to the item. It's much more complicated and time consuming to find items based on some characteristic of the data, e.g. the mean value of the double vector exceeds a certain threshold. (That's complicated enough with a structure too, but faster.)

===================================================================================

SQL (or other type of) database

Common wisdom says that 55000 "company announcements" should be stored in an SQL-database. Where did the struct come from? If your organisation already stores these announcements in a database somewhere, then a SQL-savvy person might want to help you getting started.

It's difficult to compare apples with oranges.

Before I made my HDF-file wrapper we set up an MySQL database, to store the same type of experimental time series. Large part of the job was done by a database consultant. We didn't reach my goals regarding reading speed and it didn't scale well. The time to read increased with the size of the database.

However, this has probably little relevance to your use case, but you need to start with some realistic timing experiments.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Moritz Scherrmann 2020 年 11 月 2 日

編集済み: Moritz Scherrmann 2021 年 1 月 20 日

Wow, thank you so much! This is a much more detailed and helpful answer than I expected. I really appreciate that.

Since we are updating frequently, in the long run, a database solution would probably be the best. However, your other two solutions sound very interesting. I think I'll try the solution with the structures first, since it requires the least changes in my code.

サインインしてコメントする。

Answer 2

Peter Perkins 2020 年 11 月 19 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/630033-store-different-data-types-efficiently#answer_549283

MATLAB Online で開く

Moritz, if "I stored these variables for all documents in a large struct array" is literally true, then that's your problem. I mean, you can use HDF or datastore or whatever, but you should consider using a table in a mat file instead of a struct array in a mat file. The struct array (assuming you do not mean a scalar struct OF arrays) is not an efficient way to store homogeneous "records". Consider the following:

>> t = array2table(rand(55000,10));
>> s = table2struct(t);
>> whos t s
  Name          Size               Bytes  Class     Attributes
  s         55000x1             61600640  struct              
  t         55000x10             4402930  table               

A factor of 10. I have no idea what your data really look like or how fast they would load as a table in a mat file, but it's worth looking at. You will also find that a table makes selecting subsets of your data much easier than a struct array.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Store different data types efficiently

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

Store different data types efficiently

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示