Compute group statistics when the group definition is an OR of values found in several columns of a table

Question

O.Hubert 2023 年 9 月 28 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2026769-compute-group-statistics-when-the-group-definition-is-an-or-of-values-found-in-several-columns-of-a

編集済み: O.Hubert 2023 年 9 月 29 日

Good morning,

I would like to compute group statistics when the definition of the group spans several columns which are not mutually exclusive. Think of the following example (the code corresponding to that is found below): a professor wants to compute the individual mean for 4 assignments of three students: Bob, Emma, and John. There is no restriction on who they can work with, so they can work in groups of 1, 2, or 3 for each assignment.

I have written a brute force solution below (3 versions). The reason I am unhappy with the best solution so far (v2) is that it is not easily scalable because I have to input the OR operator and the column number manually.

Additionally, I would like to match the assignment number to the student and have a table by Assignment Number, Student and their corresponding grade.

I wonder if there exists an OR condition for grpstats or similar functions. Or is there a way to easily scale/adapt v2 to accommodate more dynamic conditions? I was thinking of eval, but I know it should be avoided whenever possible.

% Mean per person when groups are not the same

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})

T = 6×5 table

Assignment Member1 Member2 Member3 Grade __________ ________ __________ __________ _____ 1 {'John'} {'Emma' } {0×0 char} 10 1 {'Bob' } {0×0 char} {0×0 char} 9 2 {'Emma'} {'Bob' } {'John' } 8 3 {'Emma'} {0×0 char} {0×0 char} 8 3 {'Bob' } {'John' } {0×0 char} 7 4 {'Bob' } {0×0 char} {0×0 char} 10

allnames=[T.Member1 ; T.Member2 ; T.Member3];

uniquenames=unique(allnames);

timespentv1=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

% Brute force

% allnames=[T.Member1 ; T.Member2 ; T.Member3];

% uniquenames=unique(allnames);

tic

for i=2:numel(uniquenames)

gradesforthatperson=[];

for j=1:3

personfound=ismember(T.(strcat('Member',num2str(j))),uniquenames{i});

gradesforthatperson=[gradesforthatperson ; T.Grade(personfound,1)];

end

averageforthatperson(i)=nanmean(gradesforthatperson);

end

timespentv1=[timespentv1 toc];

end

uniquenames'

ans = 1×4 cell array

{0×0 char} {'Bob'} {'Emma'} {'John'}

averageforthatperson

averageforthatperson = 1×4

0 8.5000 8.6667 8.3333

% v2

timespentv2=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

% Brute force v2

% allnames=[T.Member1 ; T.Member2 ; T.Member3];

% uniquenames=unique(allnames);

tic

for i=2:numel(uniquenames)

gradesforthatperson=[];

personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=nanmean(gradesforthatperson);

end

timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments

end

% v3

timespentv3=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

% Brute force v3

% allnames=[T.Member1 ; T.Member2 ; T.Member3];

% uniquenames=unique(allnames);

tic

for i=2:numel(uniquenames)

gradesforthatperson=[];

personfound= max(ismember([T.(strcat('Member',num2str(1))),T.(strcat('Member',num2str(2))),T.(strcat('Member',num2str(3)))],uniquenames{i}),[],2);

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=nanmean(gradesforthatperson);

end

timespentv3=[timespentv3 toc];

end

figure;plot([timespentv1' timespentv2' timespentv3']);legend('v1','v2','v3','location','NorthWest');

title('Time required for each version');xlabel('Size table x100.000');ylabel('seconds');

Thank you in advance.

Best regards,

P.S.: I am using R2015b.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

O.Hubert 2023 年 9 月 29 日

Accepted version: Bruno's v6.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Bruno Luong 2023 年 9 月 28 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2026769-compute-group-statistics-when-the-group-definition-is-an-or-of-values-found-in-several-columns-of-a#answer_1320909

編集済み: Bruno Luong 2023 年 9 月 28 日

MATLAB Online で開く

I submit a method without loop. I expected to be faster but to my surprise it is slower than v2. But at least it is scalable.

% Mean per person when groups are not the same
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
    {'Emma','','Bob',''    ,'John',''   }', ...
    {''    ,'','John',''    ,'',''    }' ...
    ,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
allnames=[T.Member1 ; T.Member2 ; T.Member3];
uniquenames=unique(allnames);
% v2
timespentv2=[];
for k=1:100000:1000000
    T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
        {'Emma','','Bob',''    ,'John',''   }', ...
        {''    ,'','John',''    ,'',''    }' ...
        ,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
    T=repmat(T,k,1);
    tic
    for i=2:numel(uniquenames)
        gradesforthatperson=[];
        personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;
        gradesforthatperson=[T.Grade(personfound,1)];
        averageforthatperson(i)=mean(gradesforthatperson,'omitnan');
    end
    timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
% v4
timespentv4=[];
for k=1:100000:1000000
    T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
        {'Emma','','Bob',''    ,'John',''   }', ...
        {''    ,'','John',''    ,'',''    }' ...
        ,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
    T=repmat(T,k,1);
    [tf, membercol]=ismember({'Member1' 'Member2' 'Member3'}, T.Properties.VariableNames);
    membercol = membercol(tf);
    tic
    
    averageforthatperson = zeros(1,numel(uniquenames));
    i=2:numel(uniquenames);
    Tmembercol = T{:,membercol};
    [tf,loc] = ismember(Tmembercol,uniquenames(i));
    g=repmat(T.Grade,1,size(loc,2));
    averageforthatperson(i) = accumarray(loc(tf),g(tf))./accumarray(loc(tf),1);
    timespentv4=[timespentv4 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
figure;
plot([timespentv2; timespentv4]');
legend('v2','v4','location','NorthWest');
title('Time required for each version');
xlabel('Size table x100.000');
ylabel('seconds');

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

O.Hubert 2023 年 9 月 28 日

MATLAB Online で開く

Thank you Bruno for your answer.

I was surprised as well that your loopless method was not quicker than two loops.

In terms of scalability, I have to go against the MATLAB good practice and use eval. There is still a loop, but it takes advantage of the speed of the ismember function applied to a vector (as in v2) while ensuring scalability in the number of conditions.

I may turn this into a function that takes OR, NOT, and AND conditions.

Best,

Olivier.

% Mean per person when groups are not the same

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})

T = 6×5 table

Assignment Member1 Member2 Member3 Grade __________ ________ __________ __________ _____ 1 {'John'} {'Emma' } {0×0 char} 10 1 {'Bob' } {0×0 char} {0×0 char} 9 2 {'Emma'} {'Bob' } {'John' } 8 3 {'Emma'} {0×0 char} {0×0 char} 8 3 {'Bob' } {'John' } {0×0 char} 7 4 {'Bob' } {0×0 char} {0×0 char} 10

allnames=[T.Member1 ; T.Member2 ; T.Member3];

uniquenames=unique(allnames);

% v2

timespentv2=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

tic

for i=2:numel(uniquenames)

gradesforthatperson=[];

personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=mean(gradesforthatperson,'omitnan');

end

timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments

end

% v55 - eval

timespentv55=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

% Brute force v55

% allnames=[T.Member1 ; T.Member2 ; T.Member3];

% uniquenames=unique(allnames);

tic

sizecond=3;

tobeevaluated='';

for i=1:sizecond

tobeevaluated=strcat(tobeevaluated,{' '},'ismember(T.(',num2str(i),'),uniquenames{i})');

end

for i=2:numel(uniquenames)

gradesforthatperson=[];

eval(strcat('personfound= max([',tobeevaluated{1},'],[],2);'))

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=nanmean(gradesforthatperson);

end

timespentv55=[timespentv55 toc]; % advantage is that it keeps the order of the assignments

end

figure;plot([timespentv2 ; timespentv55]');legend('v2','Eval','location','NorthWest');

title('Time required for each version');xlabel('Size table x100.000');ylabel('seconds');

Bruno Luong 2023 年 9 月 28 日

MATLAB Online で開く

You don't need EVAL, here is the v6 that is as fast as v2

% Mean per person when groups are not the same

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})

T = 6×5 table

Assignment Member1 Member2 Member3 Grade __________ ________ __________ __________ _____ 1 {'John'} {'Emma' } {0×0 char} 10 1 {'Bob' } {0×0 char} {0×0 char} 9 2 {'Emma'} {'Bob' } {'John' } 8 3 {'Emma'} {0×0 char} {0×0 char} 8 3 {'Bob' } {'John' } {0×0 char} 7 4 {'Bob' } {0×0 char} {0×0 char} 10

allnames=[T.Member1 ; T.Member2 ; T.Member3];

uniquenames=unique(allnames);

% v2

timespentv2=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

tic

for i=2:numel(uniquenames)

gradesforthatperson=[];

personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=mean(gradesforthatperson,'omitnan');

end

timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments

end

averageforthatperson

averageforthatperson = 1×4

0 8.5000 8.6667 8.3333

% v6

timespentv6=[];

for k=1:100000:1000000

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...

{'Emma','','Bob','' ,'John','' }', ...

{'' ,'','John','' ,'','' }' ...

,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});

T=repmat(T,k,1);

tic

for i=2:numel(uniquenames)

personfound = false;

for j=1:3

personfound = personfound | ismember(T.(sprintf('Member%d',j)),uniquenames{i});

end

gradesforthatperson=[T.Grade(personfound,1)];

averageforthatperson(i)=mean(gradesforthatperson,'omitnan');

end

timespentv6=[timespentv6 toc]; % advantage is that it keeps the order of the assignments

end

averageforthatperson

averageforthatperson = 1×4

0 8.5000 8.6667 8.3333

figure;

plot([timespentv2; timespentv6]');

legend('v2','v6','location','NorthWest');

title('Time required for each version');

xlabel('Size table x100.000');

ylabel('seconds');

O.Hubert 2023 年 9 月 29 日

Great! I didn't think of sprintf instead of eval, and the looping of the OR condition is very neat. It also allows to have other types of conditions (AND, NOT, etc) in a similar way.

I'll accept your answer.

Thanks again.

サインインしてコメントする。

Answer 2

Jeff Miller 2023 年 9 月 29 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2026769-compute-group-statistics-when-the-group-definition-is-an-or-of-values-found-in-several-columns-of-a#answer_1321419

MATLAB Online で開く

Maybe reformat the table with stack:

T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
    {'Emma','','Bob',''    ,'John',''   }', ...
    {''    ,'','John',''    ,'',''    }' ...
    ,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
T2 = stack(T,{'Member1','Member2','Member3'},'NewDataVariableName','Student')
tblstats = grpstats(T2,'Student','mean','DataVars','Grade')

to get

T =
  6×5 table
    Assignment    Member1      Member2       Member3      Grade
    __________    ________    __________    __________    _____
        1         {'John'}    {'Emma'  }    {0×0 char}     10  
        1         {'Bob' }    {0×0 char}    {0×0 char}      9  
        2         {'Emma'}    {'Bob'   }    {'John'  }      8  
        3         {'Emma'}    {0×0 char}    {0×0 char}      8  
        3         {'Bob' }    {'John'  }    {0×0 char}      7  
        4         {'Bob' }    {0×0 char}    {0×0 char}     10  
T2 =
  18×4 table
    Assignment    Grade    Student_Indicator     Student  
    __________    _____    _________________    __________
        1          10           Member1         {'John'  }
        1          10           Member2         {'Emma'  }
        1          10           Member3         {0×0 char}
        1           9           Member1         {'Bob'   }
        1           9           Member2         {0×0 char}
        1           9           Member3         {0×0 char}
        2           8           Member1         {'Emma'  }
        2           8           Member2         {'Bob'   }
        2           8           Member3         {'John'  }
        3           8           Member1         {'Emma'  }
        3           8           Member2         {0×0 char}
        3           8           Member3         {0×0 char}
        3           7           Member1         {'Bob'   }
        3           7           Member2         {'John'  }
        3           7           Member3         {0×0 char}
        4          10           Member1         {'Bob'   }
        4          10           Member2         {0×0 char}
        4          10           Member3         {0×0 char}
tblstats =
  3×3 table
            Student     GroupCount    mean_Grade
            ________    __________    __________
    John    {'John'}        3           8.3333  
    Emma    {'Emma'}        3           8.6667  
    Bob     {'Bob' }        4              8.5  
>> 

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

O.Hubert 2023 年 9 月 29 日

編集済み: O.Hubert 2023 年 9 月 29 日

Thank you Jeff for the loopless and very short code. I did not know about the stack function.

It does what I want and stays within the table environment. However, this is painfully slow (we are talking about 15-20x slower than the looped version.) Nevertheless, it has the advantage to be transparent. I may prefer that solution if time is not important.

Thanks again for the suggestion.

サインインしてコメントする。

Compute group statistics when the group definition is an OR of values found in several columns of a table

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (1 件)

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Compute group statistics when the group definition is an OR of values found in several columns of a table

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (1 件)

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示