Quick conversion of a large (and with mixed data types) cell array into a numerical array

16 ビュー (過去 30 日間)
Sim
Sim 2020 年 2 月 17 日
編集済み: Sim 2020 年 2 月 17 日
Hi, I have a 800'000 x 20 cell array and I would like to convert it into a 800'000 x 20 double precision array.
My cell array has the following features:
  • it is a mix of data types, i.e. some columns are character vectors (which represent numbers) and some columns are numbers.
  • within a column of character vectors, the number of characters is different among the rows.
Here, an example/part of my cell array:
a =
10×3 cell array
{'1170790,889999999897555'} {'1009,697999999999979'} {[ 1]}
{'1170260,709000000031665'} {'869,886999999999944' } {[ 2]}
{'1181959,743999999947846'} {'504,125999999999976' } {[ 3]}
{'1174019,564000000013039'} {'674,142000000000053' } {[ 4]}
{'1173729,793000000063330'} {'689,635999999999967' } {[ 5]}
{'1181555,257999999914318'} {'509,391999999999996' } {[ 6]}
{'1180374,280000000027940'} {'538,126999999999953' } {[ 7]}
{'1180124,550999999977648'} {'568,719000000000051' } {[ 8]}
{'1179305,570999999996275'} {'626,528999999999996' } {[ 9]}
{'1177132,448000000091270'} {'705,443999999999960' } {[10]}
Any suggestion to quickly/efficiently convert this (mixed data type) cell array into a numerical array, possibly without using str2double() ?
Just for information: among my attempts for conversion, I found that str2double() is very slow and
>> cell2mat(a(:,2))
Error using cat
Dimensions of arrays being concatenated are not consistent.
Error in cell2mat (line 83)
m{n} = cat(1,c{:,n});
and
>> str2num(char(a(:,3)))
Error using char
Cell elements must be character arrays.
  2 件のコメント
Stephen23
Stephen23 2020 年 2 月 17 日
Your character vectors appear to use decimal commas, which are treated as thousands separators by str2double:
>> str2double('12,345')
ans =
12345
and will result in separate numbers if you use str2num:
>> str2num('12,345')
ans =
12 345
I doubt that either of those are very useful to you.
Sim
Sim 2020 年 2 月 17 日
Yes, you are right...
I replaced the commas with dots in other parts of my code with
strrep(C,',','.')
....And I should do it for this case too... :)

サインインしてコメントする。

採用された回答

Stephen23
Stephen23 2020 年 2 月 17 日
編集済み: Stephen23 2020 年 2 月 17 日
This is likely to be faster than str2double. Use a decimal point rather than a decimal comma for more speed.
>> C = {'1170790,889999999897555','1009,697999999999979',1;'1170260,709000000031665','869,886999999999944',2;'1181959,743999999947846','504,125999999999976',3;'1174019,564000000013039','674,142000000000053',4;'1173729,793000000063330','689,635999999999967',5;'1181555,257999999914318','509,391999999999996',6;'1180374,280000000027940','538,126999999999953',7;'1180124,550999999977648','568,719000000000051',8;'1179305,570999999996275','626,528999999999996',9;'1177132,448000000091270','705,443999999999960',10}
C =
'1170790,889999999897555' '1009,697999999999979' [ 1]
'1170260,709000000031665' '869,886999999999944' [ 2]
'1181959,743999999947846' '504,125999999999976' [ 3]
'1174019,564000000013039' '674,142000000000053' [ 4]
'1173729,793000000063330' '689,635999999999967' [ 5]
'1181555,257999999914318' '509,391999999999996' [ 6]
'1180374,280000000027940' '538,126999999999953' [ 7]
'1180124,550999999977648' '568,719000000000051' [ 8]
'1179305,570999999996275' '626,528999999999996' [ 9]
'1177132,448000000091270' '705,443999999999960' [10]
>> X = cellfun('isclass',C,'char'); % slower: cellfun(@ischar,C)
>> M = nan(size(C));
>> M(~X) = [C{~X}]; % numeric data
>> M(X) = sscanf(strrep(sprintf(' %s',C{X}),',','.'),'%f') % character data
M =
1170790.890 1009.698 1
1170260.709 869.887 2
1181959.744 504.126 3
1174019.564 674.142 4
1173729.793 689.636 5
1181555.258 509.392 6
1180374.280 538.127 7
1180124.551 568.719 8
1179305.571 626.529 9
1177132.448 705.444 10
If the intermediate character vector does not fit into memory then simply do the above for each column of C.
  3 件のコメント
Stephen23
Stephen23 2020 年 2 月 17 日
編集済み: Stephen23 2020 年 2 月 17 日
"is there any way to keep all the decimals of C in M? it looks like the output is rounded, as you showed:"
That is just how they are displayed. Do not confuse how numeric data are displayed with how numeric data are stored on a computer. Those values are converted and stored with complete double precision, which means around 15-16 decimal digits. All of that precision is stored even if it is not displayed (according to the format documentation the display routine does do rounding, but of course that makes absolutely no difference to the stored values).
Judging by the input character vectors, the original numeric data were also doubles, which means that you will get back the same numeric data. or at least something very close to it (the trailing digits after that precision are just trash anyway, they have no significance once stored as a double):
>> fprintf('%.42f\n',M(:,1))
1170790.889999999897554516792297363281250000000000
1170260.709000000031664967536926269531250000000000
1181959.743999999947845935821533203125000000000000
1174019.564000000013038516044616699218750000000000
1173729.793000000063329935073852539062500000000000
1181555.257999999914318323135375976562500000000000
1180374.280000000027939677238464355468750000000000
1180124.550999999977648258209228515625000000000000
1179305.570999999996274709701538085937500000000000
1177132.448000000091269612312316894531250000000000
You can also experiment with the format options to change how those values are displayed.
Sim
Sim 2020 年 2 月 17 日
Super! Thank you very much again!
(Yes, I have just changed the format, thanks!)

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeNumeric Types についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by