Finding groups based on matching multiple values in a column

Question

0 投票

I'd like to find groups in my table based on the values of multiple columnns- but I'd like to have the groups allow multiple specific values in one of the columns.

For the "I'd like to find groups in my table based on the values of multiple columnns" part, I can do that:

>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
  6×4 table
        Store          Country        Manager       Revenue
    ______________    __________    ____________    _______
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123  
    {'Lowes'     }    {'China' }    {'Johnson' }      456  
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789  
    {'Menards'   }    {'France'}    {'Miller'  }     1010  
    {'Menards'   }    {'USA'   }    {'Williams'}     1112  
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314  
    
>> [G,~,idx]=unique(T(:,[1 2]),'stable') %want 'stable' option so not using findgroups
G =
  5×2 table
        Store          Country  
    ______________    __________
    {'Home Depot'}    {'USA'   }
    {'Lowes'     }    {'China' }
    {'Home Depot'}    {'Canada'}
    {'Menards'   }    {'France'}
    {'Menards'   }    {'USA'   }
idx =
     1
     2
     3
     4
     5
     3

But what I'd like to do is introduce an 'or' grouping rule to designate the region that Country is in. In this case, I want to group the Country by region as well. I can also do that... though not sure my converting to cell method is the best way, but it works.

     >> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
     1
     3
     5
     6

But ultimately, I want to do this at the same time- show groups that match both the Store as well as the Region. I am not sure the best way to go about this- my actual table is very large. I thought of creating a new varibale for Region and then match off that- is that the way to go or can I combine these into one sort? I was also considering looping methods, but not sure the most efficient way to proceed. My desired result would be:

  T =
  6×4 table
        Store          Country        Manager       Revenue
    ______________    __________    ____________    _______
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123  
    {'Lowes'     }    {'China' }    {'Johnson' }      456  
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789  
    {'Menards'   }    {'France'}    {'Miller'  }     1010  
    {'Menards'   }    {'USA'   }    {'Williams'}     1112  
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314     
    
    idx =
     1
     2
     1
     3
     4
     1
     
     

Thanks for any insight.

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

Adam Danz 2021 年 2 月 6 日

Sounds like you want groupsummary as Ive J mentioned.

Marcus Glover 2021 年 2 月 6 日

編集済み: Marcus Glover 2021 年 2 月 6 日

MATLAB Online で開く

Thanks. Something like the binning in groupsummary may work- but I'll have to take a look and see how I can do that with strings.

My question is more of how to do this 'better' or 'correctly'. I only ask because I have a huge table and I am not the strongest coder, but really waiting 2 minutes instead of 2 seconds is not a major issue for me. Perhaps an example will make it more clear-the following does what I want but I think it is not efficient and is probably poor code. I feel like there may be a way to do it all at once instead of creating a new variable.

%starting Table
>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
  6×4 table
        Store          Country        Manager       Revenue
    ______________    __________    ____________    _______
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123  
    {'Lowes'     }    {'China' }    {'Johnson' }      456  
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789  
    {'Menards'   }    {'France'}    {'Miller'  }     1010  
    {'Menards'   }    {'USA'   }    {'Williams'}     1112  
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314  
    
 % adds new variable for the 'Region' which makes subgroups of column 2
 
 >> T.Region=repmat({''},size(T,1),1);
 
 T =
  6×5 table
        Store          Country        Manager       Revenue      Region  
    ______________    __________    ____________    _______    __________
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123      {0×0 char}
    {'Lowes'     }    {'China' }    {'Johnson' }      456      {0×0 char}
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789      {0×0 char}
    {'Menards'   }    {'France'}    {'Miller'  }     1010      {0×0 char}
    {'Menards'   }    {'USA'   }    {'Williams'}     1112      {0×0 char}
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314      {0×0 char}
    
    %Now I want to assign group values to the variable 'Region' based on value in 'Country'
    %This step I would like to roll in to one sort on columns 1 and 2 without
    % creating column 5, but maybe it is fine. My real table is very very large.
    
>> Europe=find(ismember(table2cell(T(:,2)),{'UK','France','Gremany'}))
Europe =
     4
     
>> T(Europe,5)={'Europe'};
     
>> Asia=find(ismember(table2cell(T(:,2)),{'China','Japan','SouthKorea'}))
Asia =
     2
     
     
>> T(Asia,5)={'Asia'}
     
     
 >> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
     1
     3
     5
     6
     
 >> T(NA,5)={'NA'};
 
 >> T
 
 T =
  6×5 table
        Store          Country        Manager       Revenue      Region  
    ______________    __________    ____________    _______    __________
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123      {'NA'    }
    {'Lowes'     }    {'China' }    {'Johnson' }      456      {'Asia'  }
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789      {'NA'    }
    {'Menards'   }    {'France'}    {'Miller'  }     1010      {'Europe'}
    {'Menards'   }    {'USA'   }    {'Williams'}     1112      {'NA'    }
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314      {'NA'    }
 
 % Now I find groups that are unique in columns 1 and 5
 
 
 >> [G,~,idx]=unique(T(:,[1 5]),'stable');
 
 >> idx
idx =
     1
     2
     1
     3
     4
     1
     
     

サインインしてコメントする。

サインインしてこの質問に回答する。

サインインしてアクティビティをフォロー

Answer 1

Adam Danz 2021 年 2 月 6 日

編集済み: Adam Danz 2021 年 2 月 6 日

MATLAB Online で開く

0 投票

I recommend adding the region column to your tables. I doubt it will have an impact on computation time relative to using a separate column and it will keep the data tidy.

I prefer to work with catgegories over string when working with categorical data.

T=table(categorical({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Menards'}),...
    categorical({'ISS';'Mexico';'Canada';'France';'USA';'Germany'}),...
    categorical({'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'}),...
    [123;456;789;1010;1112;1314],...
    'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'})  % modified example
T = 6x4 table
      Store       Country    Manager     Revenue
    __________    _______    ________    _______

    Home Depot    ISS        Smith         123  
    Lowes         Mexico     Johnson       456  
    Home Depot    Canada     Jones         789  
    Menards       France     Miller       1010  
    Menards       USA        Williams     1112  
    Menards       Germany    Brown        1314  

Add region column for regions (assuming you have a key already)

% Column 1 is region name
% Column 2 is a list of countries in the region
regions = {
    categorical({'North America'}), categorical({'Canada', 'USA', 'Mexico'});
    categorical({'Europe'}),        categorical({'France', 'Germany', 'Bulgaria'});
    };
T.Region = repmat(categorical({'NA'}), height(T),1); % Default is NA
for i = 1:size(regions,1); 
    T.Region(ismember(T.Country, regions{i,2})) = regions{i,1};
end
disp(T)
      Store       Country    Manager     Revenue       Region    
    __________    _______    ________    _______    _____________

    Home Depot    ISS        Smith         123      NA           
    Lowes         Mexico     Johnson       456      North America
    Home Depot    Canada     Jones         789      North America
    Menards       France     Miller       1010      Europe       
    Menards       USA        Williams     1112      North America
    Menards       Germany    Brown        1314      Europe       

Summary of Stores by region using

G = groupsummary(T,groupvars,method,datavars)

Tsummary = groupsummary(T,["Store","Region"],["sum","mean"], "Revenue")
Tsummary = 5x5 table
      Store          Region        GroupCount    sum_Revenue    mean_Revenue
    __________    _____________    __________    ___________    ____________

    Home Depot    NA                   1             123             123    
    Home Depot    North America        1             789             789    
    Lowes         North America        1             456             456    
    Menards       North America        1            1112            1112    
    Menards       Europe               2            2324            1162    

Time test

I replicated the demo table many times to create a single table with 184,320 rows and times this solution 1000 times using tic/toc (including the construction of the regions array but not including the disp()). The median of the distribution of durations was 0.07 seconds.

2 件のコメント
なしを表示なしを非表示

Marcus Glover 2021 年 2 月 6 日

編集済み: Marcus Glover 2021 年 2 月 6 日

Thank you, that is good advice. I am constructing the table from a SQL database using select. I believe I can use setoptions to make the appropriate variables categorical.

I was looking for the indices of the rows for each group, but it seems like I will be able to pass a function handle as the method in groupsummary and skip my intended plan to loop through each group of indices. That may or may not save me time but it certianly seems nicer.

Looking at your time test, I guess I was worried for nothing! Out of curiosity, what is the reson for the (admittedly small) distribution- just whatever also may be happening in the background of your computer?

Adam Danz 2021 年 2 月 6 日

> I believe I can use setoptions to make the appropriate variables categorical.

Perhaps (I haven't looked into it) but you could also convert from char to categorical the way I did.

> That may or may not save me time but it certianly seems nicer.

I think it will save you time.

> what is the reson for the (admittedly small) distribution

There will always be variation. Here are some reasons.

Random fluxuation
JIT compilation (see this answer)
Other stuff my computer is doing that may limit resources.
Other stuff I don't know about :D

サインインしてコメントする。

Finding groups based on matching multiple values in a column

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (0 件)

カテゴリ

タグ

Community Treasure Hunt

Finding groups based on matching multiple values in a column

3 件のコメント 1 件の古いコメントを表示 1 件の古いコメントを非表示

採用された回答

2 件のコメント なしを表示 なしを非表示

その他の回答 (0 件)

カテゴリ

タグ

参考

Community Treasure Hunt

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示