Find and reduce a numeric array with identical columns

Dear Sir/Madam,
I would like to ask you the following question:
I have a data file like this
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
I would like re-group (or reduce) it with following conditions:
For any row, if column 1 AND column 3 of this row is identical with any column 1 AND column 3 of any other row. Then reduce to one new row with new value of column 2, this new value of column 2 is the sum of original values of column 2. Column 1 is kept the same, Column 4 is not important.
So, for above data, I expect to have the answer:
119 5 6832 0.20313 % 5+4-4=5
122 -48 6853 0.29688
125 -8 6853 0.4375 % -12+4=-8
121 12 6914 0.5625
119 48 6914 0.17188
What Matlab command to use? I would greatly appreciate it if you left your code and running output.
I am using MATLAB R2014a.
Thank you very much

3 件のコメント

Stephen23
Stephen23 2018 年 12 月 30 日
編集済み: Stephen23 2018 年 12 月 30 日
The order of the rows in your output is not clearly defined. What is the rule to get that order?
For example, both 121 and 122 each only occur once in the first column, but in the output matrix are listed neither in the sequence that they occur in the input matrix, nor in numeric order. How is this order supposed to be determined?
Image Analyst
Image Analyst 2018 年 12 月 30 日
I was wondering the same thing. Hopefully the order doesn't matter. I'm sure you could write the code afterwards in such a ways that it didn't matter.
John Smith
John Smith 2018 年 12 月 30 日
編集済み: John Smith 2018 年 12 月 30 日
In tmp (data file ), the order of the rows were randomly inputed by hand, no order at all.

サインインしてコメントする。

 採用された回答

Stephen23
Stephen23 2018 年 12 月 30 日

0 投票

>> [~,X,Y] = unique(tmp(:,[1,3]),'rows');
>> out = tmp(X,:);
>> out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out =
119.000000 5.000000 6832.000000 0.023438
119.000000 48.000000 6914.000000 0.171880
121.000000 12.000000 6914.000000 0.562500
122.000000 -48.000000 6853.000000 0.296880
125.000000 -8.000000 6853.000000 0.437500

7 件のコメント

John Smith
John Smith 2018 年 12 月 30 日
編集済み: John Smith 2018 年 12 月 30 日
Thank you very much Stephen,
I need further the output file "out" to be re-rewriten as classified by the value of first column and column 3 of "out":
  1. If the values of the third column are not the same. then separated them: according to the value of third column:
out_new(:,:,1) =
119 5 6832 0.20313
0 0 0 0
out_new(:,:,2) =
122 -48 6853 0.29688
125 -8 6853 0.078125
out_new(:,:,3) =
119 48 6914 0.17188
121 12 6914 0.5625
2 then the output rewritten according to the value of first column and aligned all group according to the value of the first column.
3. The value of column 1 should be decreased order in value.
I expect the output file as the following format:
125 -8 6853 0.078125
122 -48 6853 0.29688
121 12 6914 0.5625
119 5 6832 0.20313 48 6914 0.17188
By the way, my code to obtain out_new is (may be not so good:)
length(out(:,1));
[C, ia, ic] = unique(out(:,3),'sorted');
L = length(C)
for j=1 : L
k=0;
for i=1 : length(out(:,1))
if C(j) == out(i,3)
k=k+1;
out_new(k,1,j) = out(i,1) ;
out_new(k,2,j) = out(i,2) ;
out_new(k,3,j) = out(i,3) ;
out_new(k,4,j) = out(i,4) ;
end
end
end
format shortG
out_new
Thank you very much
Image Analyst
Image Analyst 2018 年 12 月 31 日
Wow, this is starting to get very quirky and complicated. Can't you take it from here? Why do you need this super specialized format anyway? What's the use case?
John Smith
John Smith 2018 年 12 月 31 日
編集済み: John Smith 2019 年 1 月 1 日
Dear Sir/Madam,
Actually, I also need column 3 to be converted to date, then the requried output format is more meaningful: (column 4 is the costs of options)
out(:,3)
6832
6914
6914
6853
6853
datestr(out(:,3))
ans =
14-Sep-0018
05-Dec-0018
05-Dec-0018
05-Oct-0018
05-Oct-0018
then the requried output format should be:
125 -8 05-Oct-0018 0.078125
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep_0018 0.20313 48 05-Dec-0018 0.17188
Now you can see, the horizontal axis is time, the vertical axis (column 1) is strik-price of option.
In this way my option trading positions are:
e. g. for 119 strike price, I have two kinds of options: bought 5 options with expiration date 14-September, and bought 48 options with expirarion date 05-Dec.
for 122 strike price option, I have only one kind option: sold 48 options with expiration date 05-Oct.
That is why this kind of output format in important.
Thank you very much for your help.
Stephen23
Stephen23 2019 年 1 月 1 日
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
Prints this:
125 -8 05-Oct-0018 0.4375
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep-0018 0.023438 48 05-Dec-0018 0.17188
John Smith
John Smith 2019 年 1 月 1 日
編集済み: Image Analyst 2019 年 1 月 1 日
Dear Stephen,
Thank you very much for your help, and Happy NEW YEAR 2019. Running this:
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
I got this error:
Error using accumarray
Second input VAL must be a full numeric, logical, or char vector or scalar.
Error in main_ZN_Dec_7_2018_trade_4 (line 364)
C = accumarray([X1,X3],baz,[],@(s)s)
Stephen23
Stephen23 2019 年 1 月 1 日
編集済み: Stephen23 2019 年 1 月 1 日
Replace the line with accumarray with these three lines:
S = max([X1,X3]);
C = cell(S);
C(sub2ind(S,X1,X3)) = baz;
and a Happy New Year!
John Smith
John Smith 2019 年 1 月 2 日
編集済み: John Smith 2019 年 1 月 3 日
Dear Stephen,
By changing those three lines, you code works.
Back to the question in the very beginning, I said " Column 4 is not important". but now, I need to treat column 4 the same way as column 2:
tmp =
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438
out =
119 5 6832 0.265631 (%=0.20313+0.039063+0.023438)
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.515625 (%=0.078125+0.4375)
how do you change your three line code:
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
I tried to modify the following way, it did not work:
out(:,2) = accumarray(Y,tmp(:,2), tmp(:,4), [],@sum)
However, when I use two lines (two time) then it worked:
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out(:,4) = accumarray(Y,tmp(:,4),[],@sum)
Thank you very much.

サインインしてコメントする。

その他の回答 (1 件)

Image Analyst
Image Analyst 2018 年 12 月 30 日
編集済み: Image Analyst 2018 年 12 月 30 日

0 投票

What about using grpstats(), if you have the Statistics and Machine Learning Toolbox.
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
col5 = 10000*tmp(:, 1) + tmp(:, 3)
tmp = [tmp, col5];
% No sum in grpstats, so have to do it twice.
% Once to get the mean and once to get the count.
outputMean = grpstats(tmp, tmp(:, 5), 'mean')
outputNumel = grpstats(tmp, tmp(:, 5), 'numel')
% Crop off temporary 5th column
output = outputMean(:, 1:4) % Initialize
% Column 2 is the sum = mean * count
output(:, 2) = outputMean(:, 2) .* outputNumel(:, 2)
The output seems to be sorted by the first column though:
output =
119 5 6832 0.088544
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.25781
That might be a problem for you. I'm not sure. Of course column 4 can be cropped off or ignored since you say it's not important.

カテゴリ

製品

リリース

R2014a

質問済み:

2018 年 12 月 30 日

編集済み:

2019 年 1 月 3 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by