Faster ways to deal with bigger data (1 to 10 TB ish)

9 ビュー (過去 30 日間)
Can Atalay
Can Atalay 2021 年 10 月 26 日
コメント済み: Can Atalay 2021 年 10 月 26 日
There are some thousands of large .csv files (each is 8 GB max.) that I absolutely have to read top to bottom to do basic operations on them (they're in my hard drive, see attachment to get an idea of what's in them). I want to convert them to .mat files after reading them using readtable(), but reading them takes days - I need them fast. Could you help optimize my plan for converting them to a more managable format via MATLAB in a short time using my ~30 USD budget? I'm not expecting y'all to teach me things from scratch or give long answers but if you have any links I could check out or even a single bit of improvement I'd be greatful - just looking for a some direction.
My current plan is to;
1- Upload the .csv files to my cloud strorage from my hard drive
2- Get EC2 instance with ~32GB RAM and download everything there
3- readtable() all of the .csv files in a for loop
4- convert the cell
{"True";"False";..;"True"}
columns to 1s and 0s for all tables (which would make everything a double)
5- split doubles by their columns for faster access in the future
6- save all (column) doubles as .mat files with a simple filename convention
7- upload all .mat files back to my cloud storage
8- download them back to my hard drive
Note 1: I have relatively fast upload/download speed but my PC overheats so I can't really split the files and read them manually without breaking something - hence the cloud + download idea, but open to suggestions otherwise.
Note 2: The 4th and 5th columns aren't always the same as each other, the 7th and 8th aren't always true or always false respectively. They're all random.
  4 件のコメント
Ive J
Ive J 2021 年 10 月 26 日
tall datastores can be much faster than readtable when you're dealing with big data. Consider the following:
ds = tabularTextDatastore('sample.txt', 'TextType', 'string'); % handling strings are much more convenient than cell arrays of char
% do other modification on the datastore
ds = tall(ds);
% do QC, filtering, etc steps (you're safe, this step won't affect your RAM usage!):
% e.g:
ds.(7)(ds.(7) == "True") = 1; % similarly for column 8, and for "False"
ds.(7) = logical(double(ds.(7))); % convert to logical
ds = gather(ds); % now read the clean table into memory
% save to mat file: by converting the table into a struct and saving to a
% mat file, the loading/accessing to variables can be easier/more
% efficient: e.g. when you need only second variable, you can just
% Var2 = load("chunk1.mat", 'Var2');
ds = table2struct(ds, 'ToScalar', true);
save("chunk1.mat", '-struct', 'ds')
Can Atalay
Can Atalay 2021 年 10 月 26 日
Thanks a bunch! This will help me big time working through the bigger ones :)

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeData Import and Analysis についてさらに検索

製品


リリース

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by