filedatastore; read M mat files of size (1xL) into single tall array that is of size (1, L*M)

7 ビュー (過去 30 日間)
Alex Hogg
Alex Hogg 2022 年 5 月 13 日
コメント済み: Alex Hogg 2022 年 5 月 16 日
I have an extremely large collection of data distributed into multiple mat files I am wanting to do some stats on as a whole entitiy, so I need to load them in as a filedatastore. The dimensions of the tall array are stopping me from getting the exact stats I am after.
Before the question is inevitably asked; I can't read all the files normally as they won't all fit in RAM at the same time, if I could, I would as I could then just easily append each array from each file as a normal matrix operation.
Mat file format:
Each file is identical in size; there is a single variable in each file of size [1 x L].
In the directory consisting of the mat files for the datastore, the number of files is M.
Attempted datastore code:
So far I've managed to get a datastore created of my desired data format, however the dimensions of the tall array are preventing me from getting the exact stats I'm after.
%find files in directory with mat file extension (all matching formats)
Datastore_Files = Search_Files(Datastore_Directory, ".mat");
%create absolute file path to each individual file, make string array for datastore input
Datastore_File_List = string(fullfile({Datastore_Files.folder}, {Datastore_Files.name}));
%create single tall datastore from array of mat files
File_Data_Store = tall(fileDatastore(Datastore_File_List, 'ReadFcn', @(x)table2array(struct2table(load(x)), 'UniformRead', true), 'UniformRead', true));
%get data store size
File_Data_Store_Size = gather(size(File_Data_Store));
disp(File_Data_Store_Size)
M L
Stats issue:
My issue due to the datastore dimensions is that for example, if I then perform a function such as mean(), I end up with a mean value per-file, rather than getting a single value representing the mean value for the entire datastore as a whole.
%Returns mean value per-file; not a single value for the whole dataset.
Test1 = gather(mean(File_Data_Store))
disp(size(Test1))
1 M
%Also returns mean value per-file; not a single value for the whole dataset.
Test2 = gather(mean(File_Data_Store(:,:)))
disp(size(Test2))
1 M
Attempted workarounds:
As above, I can't appear to perform the normal trick for a standard matrix, where if you had a multidimensional array and wanted a single mean() value representing the entire array, you could just use mean(:).
I also can't use reshape, as you can't change the size of the first dimension of a tall array.
T = reshape(File_Data_Store, 1, File_Data_Store_Size(1)*File_Data_Store_Size(2))
Error using tall/reshape (line 17)
Reshaping the first dimension of tall arrays is not supported.
Question:
Is there a way for me to concatonate the output from each file during the datastore creation such that I end up with a single tall array of dimensions [M*L, 1] instead of [M, L]?
Alternatively, is there a way I am unaware of for performing operations on a tall array as a whole; rather than each column independently (each file)?
  1 件のコメント
dpb
dpb 2022 年 5 月 13 日
See mapreduce and and example of mean() <Compute-mean-value-with-mapreduce>
There are also tall arrays and gather.
I've not used any of the above "in anger" so can only point at the doc...

サインインしてコメントする。

採用された回答

Jeremy Hughes
Jeremy Hughes 2022 年 5 月 13 日
In general, you'll have better luck identifying where the problem lies by looking at each piece of the code separately.
fcn = @(x)table2array(struct2table(load(x))); % Issue may be here
ds = fileDatastore(Datastore_File_List, 'ReadFcn', fcn, 'UniformRead', true);
A = tall(ds)
If you have a 1-by-L vector as the return of fcn, then tall will try to create an M-by-L array eventually from that data. Calling mean on that will result in the mean of each column, or an 1xL array.
A = rand(3,10)
A = 3×10
0.9645 0.9308 0.0643 0.4233 0.2423 0.7365 0.1519 0.0225 0.2807 0.4034 0.7564 0.6492 0.9459 0.3899 0.9306 0.3297 0.2200 0.2077 0.3861 0.1871 0.2675 0.1926 0.5899 0.9457 0.9813 0.3425 0.3606 0.6717 0.0494 0.0663
m = mean(A)
m = 1×10
0.6628 0.5909 0.5334 0.5863 0.7181 0.4696 0.2442 0.3006 0.2387 0.2189
I think what you are asking for is the mean of the whole array. For in-memory arrays, I would do this:
m = mean(A(:))
m = 0.4563
But tall probably won't like that.
If you modify that fcn to return the transpose,
fcn = @(x)table2array(struct2table(load(x)))'; % Note the added ' transpose character.
Now each read will result in an L-by-1 instead, and the tall array should represent an (M*L)-by-1 array. Which you can call mean on, and get a single value.
m = gather(mean(A))
  1 件のコメント
Alex Hogg
Alex Hogg 2022 年 5 月 16 日
This works perfectly; I attempted this but must have popped the transpose on the internal table read, rather than the matrix post-conversion from table.
Thank you :)

サインインしてコメントする。

その他の回答 (0 件)

製品


リリース

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by