Multi-thread parsing and loading thousands of csv files

Question

George Li 2024 年 6 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2127781-multi-thread-parsing-and-loading-thousands-of-csv-files

コメント済み: George Li 2024 年 6 月 12 日

I have a folder with 2500 csv files, each 15MB each. I currently have a script that reads each csv into a cell array container as follows at the bottom.

Unfortunately this serial process takes a very long time to open each csv one by one.

Ideally I would like to multi-thread or open multiple csv files in parallel and save them into either their own set of cell arrays per 'thread' and later combine and sort them, or into one big cell array as it is currently.

%% IMPORT FILES
directory = '\\headnode\userdata\George\ANSTO\ANSTO Day 2\Data\D14\';
datafiles = dir(append(directory,'*.csv'));
N=length(datafiles);
a = 0;
data = cell(1,N);
f = waitbar(a,'Importing Data...');
for i = 1:N
    data{i} = read_csv(strcat(datafiles(i).folder, '\', datafiles(i).name));
    waitbar(i/N,f);
end
waitbar(1,f);
close(f);

2 件のコメント
なしを表示なしを非表示

Stephen23 2024 年 6 月 12 日

Alternative: avoid loading them all into memory by using a datastore:

https://www.mathworks.com/help/matlab/large-files-and-big-data.html

https://www.mathworks.com/help/matlab/datastore.html

https://www.mathworks.com/help/matlab/tall-arrays.html

George Li 2024 年 6 月 12 日

Thanks Stephen. I have tried the Datastore method with a filedatastore and using the above code in function form as a custom readall with parallel on function and it is slower than the parfor method quite significantly. Fortunately for me I can fit all data into memory at the current stage

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Ganesh 2024 年 6 月 12 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2127781-multi-thread-parsing-and-loading-thousands-of-csv-files#answer_1470766

編集済み: Ganesh 2024 年 6 月 12 日

MATLAB Online で開く

Hi @George Li,

You will be able to parallelize the process with a "parfor" instead of using the "for" loop. Using parfor will require a "Parallel Computing Toolbox" license. The implementation would look as follows:

%% IMPORT FILES IN PARALLEL
directory = '\\headnode\userdata\George\ANSTO\ANSTO Day 2\Data\D14\';
datafiles = dir(append(directory,'*.csv'));
N = length(datafiles);
data = cell(1, N);
if isempty(gcp('nocreate'))
    parpool; % Adjust the number of workers as needed, e.g., parpool(4)
end
% Using parfor for parallel processing
parfor i = 1:N
    data{i} = readmatrix(strcat(datafiles(i).folder, '/', datafiles(i).name));
end
% Since waitbar updates are not possible inside parfor, consider alternative progress indication
disp('Data Import Complete');
delete(gcp('nocreate')); % You may choose to delete the parpool

The limiatation to this is that, you will not be able to update the "waitbar" as you are running all it parallely. You might also need to ensure that you have enough RAM to store all the ".csv" files. From your description, the files alone seem to be over 36GBs! The slowdown might also be due to the same reason.

You might want to consider processing the CSVs as a batch.

2 件のコメント
なしを表示なしを非表示

Sam Marshalik 2024 年 6 月 12 日

Just wanted to mention that you can use DataQueue to still have a waitbar with parfor or parfeval. You can learn more about it here: Send and listen for data between client and workers - MATLAB (mathworks.com). This will let you read in the files in parallel and still maintain an idea of how many files you have read in vs. how many are left.

George Li 2024 年 6 月 12 日

Thank you both. I have tried parfor with a pool of 36 processes and it now takes 178s to finish ingesting all the data vs ~20 mins with the original for loop! Thank you for your help this is perfect.

Fortunately this is short enough now that I don’t really need a wait bar anymore but I will be trying out the data queue method you have posted anyway.

サインインしてコメントする。

Multi-thread parsing and loading thousands of csv files

2 件のコメント
なしを表示なしを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Multi-thread parsing and loading thousands of csv files

2 件のコメント なしを表示なしを非表示

採用された回答

2 件のコメント なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

2 件のコメント
なしを表示なしを非表示