Merging Table with Duplicate Dates

Question

Tommaso Belluzzo 2017 年 3 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/329213-merging-table-with-duplicate-dates

コメント済み: dpb 2017 年 3 月 13 日

Hi all! I'm writing this post in the hope that someone can help me out with a little problem that requires a good solution. I have a table with the following structure:

column1 = datenum | column2 = country | column3 = value1 | column4 = value2

Let's say I load the following dataset:

1	736561	'USA'		2752	251
2	736561	'USA'		184	53
3	736561	'USA'		40	0
4	736572	'England'	1	0
5	736573	'USA'		1	0
6	736575	'USA'		1	0
7	736576	'England'	1	0
8	736577	'USA'		2	0
9	736580	'USA'		1	1
10	736581	'USA'		1	0
11	736582	'USA'		1	0
12	736599	'USA'		1	0
13	736619	'USA'		5	0
14	736619	France'		1	1
15	736683	'USA'		1	0

Now, what I need to to is to merge together the rows with the same date. As you can see, this is the case for the rows in the intervals 1:3 and 13:14. I have to do this following a few simple criteria:

if the country in the duplicate rows is always the same, the final row should still show that country, otherwise it must show "Multiple";
value1 and value2 of the final row must be the sum of value1 and value2 of the duplicate rows. Following those criteria, the table in the above example should become:

1	736561	'USA'		2976	304
2	736572	'England'	1	0
3	736573	'USA'		1	0
4	736575	'USA'		1	0
5	736576	'England'	1	0
6	736577	'USA'		2	0
7	736580	'USA'		1	1
8	736581	'USA'		1	0
9	736582	'USA'		1	0
10	736599	'USA'		1	0
11	736619	'Multiple'	6	1
12	736683	'USA'		1	0

Thanks for your help!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Kelly Kearney 2017 年 3 月 10 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/329213-merging-table-with-duplicate-dates#answer_258218

MATLAB Online で開く

The accumarray function is designed for this sort of problem:

% Your data
data = {...
736561  'USA'    2752  251
736561  'USA'    184  53
736561  'USA'    40  0
736572  'England'  1  0
736573  'USA'    1  0
736575  'USA'    1  0
736576  'England'  1  0
736577  'USA'    2  0
736580  'USA'    1  1
736581  'USA'    1  0
736582  'USA'    1  0
736599  'USA'    1  0
736619  'USA'    5  0
736619  'France'        1  1
736683  'USA'    1  0};
data = cell2table(data, 'VariableNames', {'date', 'country', 'value1', 'value2'});
% Summing the values is the default of accumarray
[unqdate, ~, idx] = unique(data.date);
val1 = accumarray(idx, data.value1);
val2 = accumarray(idx, data.value2);
% Accumarray is very picky about its inputs, so analyzing a cell array
% requires a few extra steps...
[unqcountry, ~, cidx] = unique(data.country);
unqcountry = [unqcountry; 'Multiple'];
ctmp = accumarray(idx, cidx, [max(idx) 1], @(x) {unique(x)});
hasmult = cellfun(@(x) length(x) > 1, ctmp);
ctmp{hasmult} = max(cidx)+1;
ctmp = cat(1, ctmp{:});
newdata = table(unqdate, unqcountry(ctmp), val1, val2)

And the results:

newdata = 
       unqdate         Var2       val1    val2
      __________    __________    ____    ____
3656e+05    'USA'         2976    304 
3657e+05    'England'        1      0 
3657e+05    'USA'            1      0 
3658e+05    'USA'            1      0 
3658e+05    'England'        1      0 
3658e+05    'USA'            2      0 
3658e+05    'USA'            1      1 
3658e+05    'USA'            1      0 
3658e+05    'USA'            1      0 
       7.366e+05    'USA'            1      0 
3662e+05    'Multiple'       6      1 
3668e+05    'USA'            1      0

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Answer 2

Peter Perkins 2017 年 3 月 10 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/329213-merging-table-with-duplicate-dates#answer_258243

MATLAB Online で開く

varfun with grouping variables does this in one line:

>> c = { 1 736561 'USA'  2752 251
         ...
        15 736683 'USA'  1 0};
>> t = cell2table(c,'VariableNames',{'ID' 'Date' 'Country' 'X' 'Y'});
>> t.Date = datetime(t.Date,'ConvertFrom','datenum','Format','dd-MMM-yyyy')
t =
  15×5 table
    ID       Date         Country      X       Y 
    __    ___________    _________    ____    ___
     1    19-Aug-2016    'USA'        2752    251
     2    19-Aug-2016    'USA'         184     53
     3    19-Aug-2016    'USA'          40      0
     4    30-Aug-2016    'England'       1      0
     5    31-Aug-2016    'USA'           1      0
     6    02-Sep-2016    'USA'           1      0
     7    03-Sep-2016    'England'       1      0
     8    04-Sep-2016    'USA'           2      0
     9    07-Sep-2016    'USA'           1      1
    10    08-Sep-2016    'USA'           1      0
    11    09-Sep-2016    'USA'           1      0
    12    26-Sep-2016    'USA'           1      0
    13    16-Oct-2016    'USA'           5      0
    14    16-Oct-2016    'France'        1      1
    15    19-Dec-2016    'USA'           1      0
>> tSum = varfun(@sum,t,'GroupingVariables',{'Date' 'Country'},'InputVariables',{'X' 'Y'})
tSum =
  13×5 table
       Date         Country     GroupCount    sum_X    sum_Y
    ___________    _________    __________    _____    _____
    19-Aug-2016    'USA'        3             2976     304  
    30-Aug-2016    'England'    1                1       0  
    31-Aug-2016    'USA'        1                1       0  
    02-Sep-2016    'USA'        1                1       0  
    03-Sep-2016    'England'    1                1       0  
    04-Sep-2016    'USA'        1                2       0  
    07-Sep-2016    'USA'        1                1       1  
    08-Sep-2016    'USA'        1                1       0  
    09-Sep-2016    'USA'        1                1       0  
    26-Sep-2016    'USA'        1                1       0  
    16-Oct-2016    'France'     1                1       1  
    16-Oct-2016    'USA'        1                5       0  
    19-Dec-2016    'USA'        1                1       0

That uses datetimes, not datenums, which is not crucial, but if you have R2014b or later, you're better off with datetimes. If you have R2016b or later, you could also use timetables. Actually, in 16b, there's a limitation where you cannot use both Date and Country as grouping variables, so you'd have to convert to a table. But in the recently-released R2017a, you can:

>> tt = table2timetable(t(:,2:end));
>> tSum = varfun(@sum,tt,'GroupingVariables',{'Date' 'Country'},'InputVariables',{'X' 'Y'})
tSum =
  13×4 timetable
        Date        Country     GroupCount    sum_X    sum_Y
    ___________    _________    __________    _____    _____
    19-Aug-2016    'USA'        3             2976     304  
    30-Aug-2016    'England'    1                1       0  
    31-Aug-2016    'USA'        1                1       0  
    02-Sep-2016    'USA'        1                1       0  
    03-Sep-2016    'England'    1                1       0  
    04-Sep-2016    'USA'        1                2       0  
    07-Sep-2016    'USA'        1                1       1  
    08-Sep-2016    'USA'        1                1       0  
    09-Sep-2016    'USA'        1                1       0  
    26-Sep-2016    'USA'        1                1       0  
    16-Oct-2016    'France'     1                1       1  
    16-Oct-2016    'USA'        1                5       0  
    19-Dec-2016    'USA'        1                1       0

You can also aggregate data in a timetable using retime, but varfun is the place to go to aggregate with respect to both time and other grouping variables.

Hope this helps.

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Peter Perkins 2017 年 3 月 13 日

編集済み: Peter Perkins 2017 年 3 月 13 日

MATLAB Online で開く

Thanks, I hadn't noticed that. I think I'd probably still call varfun but group only on time, and also apply another function to the country names.

t = cell2table(c,'VariableNames',{'ID' 'Date' 'Country' 'X' 'Y'});
t.Date = datetime(t.Date,'Format','dd-MMM-yyyy');
t.Country = categorical(t.Country);
tSum1 = varfun(@sum,t,'GroupingVariables','Date','InputVariables',{'X' 'Y'});
tSum2 = varfun(@countryOrMultiple,t,'GroupingVariables','Date','InputVariables','Country');
tsum = [tSum2 tSum1(:,3:end)]
tsum =
  12×5 table
       Date        GroupCount    countryOrMultiple_Country    sum_X    sum_Y
    ___________    __________    _________________________    _____    _____
    19-Aug-2016    3             USA                          2976     304  
    30-Aug-2016    1             England                         1       0  
    31-Aug-2016    1             USA                             1       0  
    02-Sep-2016    1             USA                             1       0  
    03-Sep-2016    1             England                         1       0  
    04-Sep-2016    1             USA                             2       0  
    07-Sep-2016    1             USA                             1       1  
    08-Sep-2016    1             USA                             1       0  
    09-Sep-2016    1             USA                             1       0  
    26-Sep-2016    1             USA                             1       0  
    16-Oct-2016    2             Multiple                        6       1  
    19-Dec-2016    1             USA                             1       0

Given that I've made Country a categorical, here's a function to deal with 'Multiple':

function uc = countryOrMultiple(c)
% return either the scalar input categorical value, or the scalar
% categorical value 'Multiple'
uc = unique(c);
if ~isscalar(uc)
    uc = uc(1); % keep all the existing categories
    uc(1) = 'Multiple';
end
end

And because the grouping is on time alone, you could use either varfun (or retime, for that matter) on a timetable, in R2016b or later.

dpb 2017 年 3 月 13 日

That's an alternative, Peter; and good reminder that the function handle can be anything w/ the proper input/output arguments. I was thinking of an 'onerror' alternate function or value that could possibly cope with the case in one go as an enhancement.

Good practice for table, unfortunately to date I'm limited to R2014b owing to hardware limitations (32-bit) and haven't had opportunity to work with the class to any great extent. Converting to categorical is good; don't think of them too much yet, either...

サインインしてコメントする。

Answer 3

dpb 2017 年 3 月 10 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/329213-merging-table-with-duplicate-dates#answer_258225

編集済み: dpb 2017 年 3 月 11 日

MATLAB Online で開く

Almost same solution as above...the only real significant difference is that strcmp is cell-aware so don't need cellfun to process the country names.

d=readtable('tomm.dat','readvariablenames',0,'delimiter',' ');   % get the data
[u,ia,ib]=unique(d.Var1);       % unique dates, locations in both length arrays
[n,ix]=histc(ib,1:length(ia));  % count the number of each group to see who's duplicated
for i=find(n>1).'               % for those that are duplicates
  if ~all(strcmp(d.Var2(i==ix),d.Var2(i)))   % if not all same country
    d.Var2(i==ix)={'Multiple'};              % set country value to 'Multiple'
  end
end

Now with preparatory work out of way, build the output table. Rows wanted from existing are the unique from the ia index vector for the date/country as modified for country plus the same accumarray results for the data--

d=[d(ia,1:2) array2table(accumarray(ib,d.Var3),'variablenames',{'Var3'}) ...
             array2table(accumarray(ib,d.Var4),'variablenames',{'Var4'})];
>> d
d = 
     Var1          Var2       Var3    Var4
  __________    __________    ____    ____
  7.3656e+05    'USA'         2976    304 
  7.3657e+05    'England'        1      0 
  7.3657e+05    'USA'            1      0 
  7.3658e+05    'USA'            1      0 
  7.3658e+05    'England'        1      0 
  7.3658e+05    'USA'            2      0 
  7.3658e+05    'USA'            1      1 
  7.3658e+05    'USA'            1      0 
  7.3658e+05    'USA'            1      0 
   7.366e+05    'USA'            1      0 
  7.3662e+05    'Multiple'       6      1 
  7.3668e+05    'USA'            1      0 
>>

Seems like ought to be able to just append the arrays to the selected subsets of the table, but the casting function was only way seemed to work; seems like more work than what should be necessary; maybe that's an enhancement request for silent conversion, who knows...

ADDENDUM

I'm still pretty green with the table class, it is somewhat easier following Kelly's lead using table on the arrays jointly rather than individually--still have to force names though, it seems to not clash.

d=[d(ia,1:2) table(accumarray(ib,d.Var3),accumarray(ib,d.Var4),'variablenames',{'Var3','Var4'})]

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Merging Table with Duplicate Dates

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (2 件)

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

Merging Table with Duplicate Dates

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (2 件)

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示