speed up renamecats/categorical multiple columns

Question

Peng Li 2020 年 5 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns

コメント済み: Peng Li 2020 年 10 月 9 日

I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.

I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.

See below an example

tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["123", "234", "345", "456", "567"]', ...
    'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["apple", "banana", "cat", "dog", "elephont"]', ...
    'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
    ["East", "West", "North", "South", "Middle"]', ...
    'VariableNames', {'Code', 'Meaning'});
Vars   = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
    tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
        dictionary.(Vars{iC}).Meaning);
end

Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).

Thank you!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Campion Loong 2020 年 10 月 9 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns#answer_509771

MATLAB Online で開く

Hi Peng,

It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).

In this case, why not create the table with categorical to begin with:

tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"],      dictionary.A.Code, dictionary.A.Meaning),...
            categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
            'VariableNames', {'A', 'B'});

There is no loop, faster and much more readable.

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Campion Loong 2020 年 10 月 9 日

If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.

If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:

Peng Li 2020 年 10 月 9 日

Hi Campion, thanks again for you attention. I've actually tried different options -- tall array, datastore, transform a datastore, mapreduce, or readall in a server (over 380G ram) a while ago. This is easily handlable.

The issue is with this de-coding part. It is simply too slow to do a loop. And ImportOptions couldn't help with the decoding of the actual data I guess, as i have to load the data first and do the decoding.

I've tried a way using transform datastore. Basically in the transform function, I do the decoding, and then write the datastore to disk. It works, but slow too.

I have several workable solutions now but just no one gives me the best speed. The single file is around 20G in cvs format, with over half a million rows and almost 10 thunsands of columns. With my server this tasks takes over 24 hours so I guess i just need to be a bit patient to let the server work while i'm doing something else.

サインインしてコメントする。

speed up renamecats/categorical multiple columns

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

speed up renamecats/categorical multiple columns

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示