Why does the speed of histcounts vary so much?
6 ビュー (過去 30 日間)
古いコメントを表示
I have a vector with integer or floating point values and want to count the number of occurences.
histcounts() uses the lase element of the input Edge as upper limit for the last bin, so I have to append a number larger than the greatest element to count correctly. This slows down the processing by a factor of > 2, when the values are integers:
% Integer values:
x = round((1:1e6) / 1); % Equivalent for "/ 10" also
x = x(randperm(numel(x)));
ux = unique(x);
ux2 = [ux, Inf];
ux3 = [ux, 1e8]; % Any huge number
ux4 = [ux, ux(end) + 1];
timeit(@() histcounts(x, ux)) % 0.045
timeit(@() histcounts(x, ux2)) % 0.171 % Much slower!!!
timeit(@() histcounts(x, ux3)) % 0.199 % Slower than with Inf?
timeit(@() histcounts(x, ux4)) % 0.0787 % Fast again
% BinMethod=Integers is fast, but the number of bins is limited:
timeit(@() histcounts(x, 'BinMethod', 'Integers')) % 0.0059 Fast!
numel(histcounts(x, 'BinMethod', 'Integers')) % But just 10001 instead of 1e6
% According to the documentation, this should be 65536 ?!?
% Floating point values:
y = rand(1, 1e6);
uy = unique(y);
uy2 = [uy, Inf];
uy3 = [uy, uy(end) + 1];
timeit(@() histcounts(y, uy)) % 0.11148
timeit(@() histcounts(y, uy2)) % 0.2066
timeit(@() histcounts(y, uy3)) % 0.2100
% Local R2018b: similar behavior for integers, floating point values processed faster:
% 0.073 % Integer, ux
% 0.199 % [ux, Inf]
% 0.189 % [ux, 1e8]
% 0.061 % [ux, ux(end + 1)]
% 0.162 % Floating point, uy
% 0.167 % [uy, Inf]
% 0.315 % [uy, uy(end) + 1] % half speed!!!
I'd expect nearly the same speed, because it is almost the same amount of work.
By the way, a simple counting function is fast also:
timeit(@() simpleCount(x)) % 0.0602
timeit(@() simpleCount(y)) % 0.0628
function N = simpleCount(x)
S = sort(x);
q = [true, diff(S) ~= 0];
N = diff([find(q), numel(x) + 1]);
end
histcounts() with counting the last element to the previous bin is not useful, but attaching a final edge slows down the processing massively. I've sent an enhancement request to TWM already.
5 件のコメント
Adam Danz
2021 年 8 月 22 日
編集済み: Adam Danz
2021 年 8 月 22 日
Looks like groupcounts does not have this limitation
N = groupcounts((1:70000)');
fprintf('Numel: %d Last count: %d\n', numel(N), N(end))
And its timing for counting integers is similar to the [ux, ux(end) + 1] method but slower.
x = round((1:1e6) / 1);
x = x(randperm(numel(x)));
ux = [unique(x), inf];
ux2 = [unique(x), max(x)+1];
timeit(@() histcounts(x, ux))
timeit(@() histcounts(x, ux2))
timeit(@() groupcounts(x'))
isequal(histcounts(x,ux), histcounts(x, ux2), groupcounts(x')')
回答 (1 件)
Bruno Luong
2021 年 8 月 22 日
It's just an assumption but histcounts probably use binary search on finite edge values. The large last edge probably penalizes the speed. Not sure how Inf is processed with bonary search (possibly an extra comparison check?)
2 件のコメント
Bruno Luong
2021 年 8 月 22 日
"The BinMethod 'Integers' is useful for the case, when it can be guaranteed, that 2^16 bins are enough"
Just take your hand over and use accumarray instead.
参考
カテゴリ
Help Center および File Exchange で Logical についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!