use unstack to reshape table with dummy variable (edited: alternative crosstab method)

After I sort table according to a group variable, I like to put the group value as the 'header' column and put its members in the other columns in the same row.
Data for the sake of demostration:
C = {'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'});
[GroupsID, Groups] = findgroups(T.code);
unique_groupID = unique(GroupsID);
gT = table('Size', [10,4], 'VariableTypes', {'string', 'string', 'string', 'string'});
Method 1. (edited.) brutal for-loop that I don't like, and its result needs more processing to remove redundancy in each row.
for k=1:size(unique_groupID)
% extract group elements from 'fruit'
tmp = T.fruit(GroupsID==unique_groupID(k));
l = size(tmp',2);
gT(k,1) = {Groups(k)};
gT(k,2:l+1) = tmp';
end
rmmissing(gT, "MinNumMissing", 3)
ans = 3×4 table
Var1 Var2 Var3 Var4 ____ ________ _________ _________ "q1" "apple" "appl" <missing> "q2" "banana" <missing> <missing> "q3" "orange" "orang" "orange"
Method 2 using unstack
I created a dummy variable for this method in order to use unstack( ). The code is shorter but doesn't give me the result I want.
D = {'dm1', 'dm2', 'dm3', 'dm4', 'dm5', 'dm6';
'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(D, 'VariableNames', {'dummy', 'code', 'fruit'});
unstack(T, "fruit", "dummy")
ans = 3×7 table
code dm1 dm2 dm3 dm4 dm5 dm6 ______ __________ __________ __________ __________ __________ __________ {'q1'} {'apple' } {'appl' } {0×0 char} {0×0 char} {0×0 char} {0×0 char} {'q2'} {0×0 char} {0×0 char} {'banana'} {0×0 char} {0×0 char} {0×0 char} {'q3'} {0×0 char} {0×0 char} {0×0 char} {'orange'} {'orang' } {'orange'}
Edited. Method 3 using crosstab. This method works nicely, but I wish I don't have to use a for-loop. The result from this method is exactly what I want.
[tb,~,~,lbs] = crosstab(T.code, T.fruit);
For-loop to create the intended table:
m = size(tb,1);
header = lbs(1:m,1);
fruits = lbs(:,2);
gT = table('Size',[6,4], 'VariableTypes', {'string','string','string','string'});
for i=1:m
tmp = fruits(tb(i,:)>0)';
l = size(tmp,2);
gT(i,"Var1") = header(i);
gT(i, 2:l+1) = tmp;
end
rmmissing(gT, "MinNumMissing",4)
ans = 3×4 table
Var1 Var2 Var3 Var4 ____ ________ _________ _________ "q1" "apple" "appl" <missing> "q2" "banana" <missing> <missing> "q3" "orange" "orang" <missing>
Edited. After I post the above code, I thought about that Method 3 may be made leaner.
ix = find(tb>0);
[rows,cols]=ind2sub([3,4],ix);
% then for-loop through rows and cols to populate the final table.
% I still can't avoid for-loop.

2 件のコメント

the cyclist
the cyclist 2023 年 4 月 1 日
I'm confused about the organizing principle of the output you want. Is the following correct?
  • Currently, you have one row in your table for each code/fruit pair.
  • Instead, you one row for each code
  • The columns are "1st fruit", "2nd fruit", "3rd fruit"
  • In any row, you want the list of the fruits for that code
  • If the code doesn't have 3 fruits, have an empty entry in the table
Also, your Solution 1 and Solution 3 are different, but you say they are both correct. That's confusing to me.
If this is all correct, I'd be curious what your downstream step is. Your data are currently stored in what is known as tidy format, and that is almost always better for analysis.
Simon
Simon 2023 年 4 月 1 日
Thanks for your quick response. The result from Solution 1 needs more processing to remove redundancy in each row. Solution 3 has the correct result.
My real data have more than hundreds of thousands of rows. Its first column stores 'account codes', the second column, 'account definition', and the third column is financial numerical values. For example, '1XXXXX' is the code, and 'Assets' is the account definition. These two should have a perfect 1-1 relationship. However, owing to human factor, the actual entry for 'account definition' for a given account code might slightly vary. For example, 'Assets' may be keyed in as 'Assest', or 'Aset'.
The code's downstream step is for me to visually check if there is anything odd about the 'account codes'-'account definition'. There are only around a hundred of unique 'account codes', which is more managable to human inspection than the original super-tall table.

サインインしてコメントする。

 採用された回答

Stephen23
Stephen23 2023 年 4 月 1 日
編集済み: Stephen23 2023 年 4 月 27 日
Using UNSTACK is quite a neat solution because it will automatically pad different-length data to the same number of columns, adding "missing" values as required. This is otherwise fiddly to replicate. But to use UNSTACK, we need to add a variable that tells UNSTACK which columns to move the data into:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
U = unique(T,'rows');
G = findgroups(U.code); % note1
F = @(n)(1:nnz(n==G)).'; % note1
U.count = cell2mat(arrayfun(F,unique(G),'uni',0)) % note1
U = 5×3 table
code fruit count ______ __________ _____ {'q1'} {'apple' } 1 {'q1'} {'appl' } 2 {'q2'} {'banana'} 1 {'q3'} {'orange'} 1 {'q3'} {'orang' } 2
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
code x1 x2 ______ __________ __________ {'q1'} {'apple' } {'appl' } {'q2'} {'banana'} {0×0 char} {'q3'} {'orange'} {'orang' }
note1: This just gives a unique index to each element of a group. Astonishingly there does not seem to be an easy inbuilt way to achieve this... does anyone have any tips?: e.g. [1,1,1,2,2,1] -> [1,2,3,1,2,4] .
EDIT: I found a neater way:
G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum);

11 件のコメント

Simon
Simon 2023 年 4 月 2 日
編集済み: Simon 2023 年 4 月 3 日
I appreciate very much with your codes, which are very informative, and I have learned several new things from. As for the challenge you mention about creating within-group indexing, I try to get to that with a for-loop. ....(edited. my codes are distraction here, so I edited it out.) Thanks again for your help!
Simon
Simon 2023 年 4 月 3 日
When I applied your code to my real data, it fails. My data have around 160,000 rows and unique 1700 groups. So far I found one group, whose within-group indices are not right. Normally, they should be [1;2;3], but your codes generate [1;2;1], resulting unstack error.
However, if I sortrows( U) first, the within-group indices are correct. Why is that? Is there a bug in arrayfun( )?
Stephen23
Stephen23 2023 年 4 月 3 日
編集済み: Stephen23 2023 年 4 月 3 日
"However, if I sortrows( U) first, the within-group indices are correct."
Try UNIQUE without the STABLE option:
U = unique(T,'rows');
Note that the algorithm that I gave assumes unique rows.
Thanks for pointing out the solution. The codes work fine now. Testing with the more unordered data below, I see why 'stable' option will cause the error. When U is not sorted, cell2mat(cell array of 'count') will put counts in wrong group.
C = {'q1', 'q3','q1', 'q2', 'q3', 'q3', 'q1', 'q3';
'apple', 'orange','appl', 'banana', 'orang','orange', 'apple', 'oranges'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'});
U = unique(T,'rows','stable');
G = findgroups(U.code);
F = @(n)(1:nnz(n==G))';
count = arrayfun(F,unique(G),'UniformOutput',false);
% count = {1;2};{1};{1;2;3} is correct.
{2×1 double} {[ 1]} {3×1 double}
U.count = cell2mat(count)
U = 6×3 table
code fruit count ______ ___________ _____ {'q1'} {'apple' } 1 {'q3'} {'orange' } 2 {'q1'} {'appl' } 1 {'q2'} {'banana' } 1 {'q3'} {'orang' } 2 {'q3'} {'oranges'} 3
% when U is not sorted and cell2mat( ) does
% what it's supposed to do, error occurs.
Stephen23
Stephen23 2023 年 4 月 4 日
@Simon: good point, I removed STABLE from my answer.
Here is another UNSTACK-based approach, generating the group indices using ACCUMARRAY:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);
F = @(a){cumsum(a)};
U.count = cell2mat(accumarray(G,ones(size(G)),[],F))
U = 5×3 table
code fruit count ______ __________ _____ {'q1'} {'appl' } 1 {'q1'} {'apple' } 2 {'q2'} {'banana'} 1 {'q3'} {'orang' } 1 {'q3'} {'orange'} 2
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
code x1 x2 ______ __________ __________ {'q1'} {'appl' } {'apple' } {'q2'} {'banana'} {0×0 char} {'q3'} {'orang' } {'orange'}
Simon
Simon 2023 年 4 月 6 日
That algorithm is exemplary for clean code! I tried accumary( ) before, but being the first time using this function, I couldn't figure out how to create a solution out of it. I would suggest you to wrap up the algorithm and codes you create for this question and load them up to File Exchange. I believe it will be beneficial recurrently to Matlab community. I personally made very enjoyable progress from going through your solutions.
Simon
Simon 2023 年 4 月 9 日
編集済み: Simon 2023 年 4 月 9 日
I have tried on my real data with the three algorithms, one by myself, the other twos by Stephen23, and organized them into three functions. So it would be more convenient for anyone to use them as practical solution or as learning materials.
The data is a table with about 160,000 rows and 1600 unique 'ifcode'. Each algorithm took elapsed time as
crosstab_forloop: 1.81 seconds
unstack_applyfun: 1.47 seconds
unstack_accumarray: 1.50 seconds.
Here are the three functions:
% T is a table, with one column called 'ifcode',
% the other column called 'account'.
function gT = crosstab_forloop(T)
[tb,~,~,lbs] = crosstab(T.ifcode, T.account);
m = size(tb,1);
header = lbs(1:m,1);
accounts = lbs(:,2);
gT = table('Size',[4000,4], 'VariableTypes', {'string','string','string','string'}, ...
'VariableNames',{'ifcode', 'x1', 'x2', 'x3'});
for i=1:m
tmp = accounts(tb(i,:)>0)';
l = size(tmp,2);
gT(i,"ifcode") = header(i);
gT(i, 2:l+1) = tmp;
end
gT = rmmissing(gT, "MinNumMissing",4);
end
function U = unstack_arrayfun(T)
U = unique(T, 'rows');
G = findgroups(U.ifcode);
F = @(n)(1:nnz(n==G))';
U.withindex = cell2mat(arrayfun(F,unique(G),'UniformOutput',false)); % note1
U = unstack(U,"account","withindex","VariableNamingRule","modify");
% note1: This just gives a unique index to each element of a group.
% algorithm credit belongs to Stephen23
end
function U = unstack_accumarray(T)
U = unique(T,'rows');
G = findgroups(U.ifcode);
F = @(a){cumsum(a)};
U.withindex = cell2mat(accumarray(G,ones(size(G)),[],F));
U = unstack(U,"account","withindex","VariableNamingRule","modify");
% algorithm credit belongs to Stephen23
end
Stephen23
Stephen23 2023 年 4 月 10 日
編集済み: Stephen23 2023 年 4 月 11 日
"So it would be more convenient for anyone to use them as practical solution or as learning materials.
Most likely the CELL2MAT slows them down... you never wrote that you needed a particularly fast approach, so I did not consider that in my code (instead aiming for "reasonably compact", which is what most users on this forum seem to want). For a "reasonably fast" approach try replacing CELL2MAT with a comma-separated list.
Something like this might be even faster:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
[U,~,X] = unique(T.code);
for k = 1:numel(U)
V = unique(T{k==X,'fruit'});
U(k,2:1+numel(V)) = V;
end
W = cell2table(U)
W = 3×3 table
U1 U2 U3 ______ __________ ____________ {'q1'} {'appl' } {'apple' } {'q2'} {'banana'} {0×0 double} {'q3'} {'orang' } {'orange' }
Simon
Simon 2023 年 4 月 13 日
When run over my real data, this soulution took elassed time of only 3.67 seconds. Your verctorized codes have both elegance and speed superiority over this forloop algorithm. Probably, the unique( ) funttion in the for-loop runs slows thing down, or maybe it's the array segmental assignment (U(k,2:...) = V. In my experience, value assignment to a segment of array, cell, and table causes sluggishness.
Great thanks for it nevertheless.
Stephen23
Stephen23 2023 年 4 月 27 日
編集済み: Stephen23 2023 年 4 月 27 日
I thought of another approach based on GROUPTRANSFORM:
As mentioned in my answer, the desired transformation is [1,1,1,2,2,1] -> [1,2,3,1,2,4] .
G = [1;1,;1;2;2;1]; % must be column vector
Y = grouptransform(ones(size(G)),G,@cumsum)
Y = 6×1
1 2 3 1 2 4
Nice, it seems to work as we want. However in this case G luckily consists of integers 1..N. In all other cases we need to use e.g. FINDGROUPS first. Lets try it with the fake data that I used in my answer:
C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
code fruit ______ __________ {'q1'} {'apple' } {'q1'} {'appl' } {'q2'} {'banana'} {'q3'} {'orange'} {'q3'} {'orang' } {'q3'} {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum)
U = 5×3 table
code fruit count ______ __________ _____ {'q1'} {'appl' } 1 {'q1'} {'apple' } 2 {'q2'} {'banana'} 1 {'q3'} {'orang' } 1 {'q3'} {'orange'} 2
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
code x1 x2 ______ __________ __________ {'q1'} {'appl' } {'apple' } {'q2'} {'banana'} {0×0 char} {'q3'} {'orang' } {'orange'}

サインインしてコメントする。

その他の回答 (1 件)

Peter Perkins
Peter Perkins 2023 年 4 月 5 日
I'm having trouble understanding the desireed output, but others have created what is essentially a crosstabulation of counts, so, new in R2023a
>> T = cell2table(C, 'VariableNames', {'code', 'fruit'});
>> pivot(T,Rows="code",Columns="fruit")
ans =
3×7 table
code appl apple banana orang orange oranges
______ ____ _____ ______ _____ ______ _______
{'q1'} 1 2 0 0 0 0
{'q2'} 0 0 1 0 0 0
{'q3'} 0 0 0 1 2 1
As cyclist points out, there are a bunch of empty bins, so the original "tidy" format may be more useful. To me, this looks like a case of "fruit ought to be categorical, and you ought to apply mergecats to clean up those typos/different spellings".

7 件のコメント

Simon
Simon 2023 年 4 月 6 日
編集済み: Simon 2023 年 4 月 6 日
The real data I work on have more than hundreds of unique 'codes', some of which have more than two or three 'definitions' (or 'fruit' in my testing data set). So the pivot( ) will create more than a thousand columns, which would be too wide for human eyes to use, so I need to lump them into 3 columns. In the end, I want to have table variables as {'code', 'defition1', 'definition2', 'definition3'}, and the rows will be like
'1100', 'account receivable', 'acct receivable', 'some serious error';
'2100', 'accounts payable', 'accounts pay net'';
...
In an ideal situaltion, there should be only one definition for only one code. However, human reporting errors occur. The purpose of the codes is for human inspector to see how seriouly the definition typos occur, and this would help to investigate data corruption at large.
I will try mergecat to clean up the typos. Thanks for suggesting pivot( ) and mergecats. I didn't know those functions before.
Simon
Simon 2023 年 4 月 6 日
@Peter Perkins I just play around with mergecats( ), but couldn't see how it could be useful. In the Method 3 above, I used forloop to squeeze the 'fruit' columns for each 'code' to the left. Could you suggest me the codes using mergecats( ) to achieve the same result? I would be very grateful for that because I really want to learn how to use mergecats( ) in this question or any method that does not rely on forloop (for the love of Matlab vectorization).
Peter Perkins
Peter Perkins 2023 年 4 月 6 日
編集済み: Peter Perkins 2023 年 4 月 6 日
I'm not saying that this is definitely better, just that it's worth considering. One of the purposes of categorical is to make it simpler to clean up data like this.
C = {'q1', 'q3','q1', 'q2', 'q3', 'q3', 'q1', 'q3';
'apple', 'orange','appl', 'banana', 'orang','orange', 'apple', 'oranges'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'})
T = 8×2 table
code fruit ______ ___________ {'q1'} {'apple' } {'q3'} {'orange' } {'q1'} {'appl' } {'q2'} {'banana' } {'q3'} {'orang' } {'q3'} {'orange' } {'q1'} {'apple' } {'q3'} {'oranges'}
T = convertvars(T,["code" "fruit"],"categorical")
T = 8×2 table
code fruit ____ _______ q1 apple q3 orange q1 appl q2 banana q3 orang q3 orange q1 apple q3 oranges
categories(T.fruit)
ans = 6×1 cell array
{'appl' } {'apple' } {'banana' } {'orang' } {'orange' } {'oranges'}
T.fruit = mergecats(T.fruit,["apple" "appl"]);
T.fruit = mergecats(T.fruit,["orange" "orang" "oranges"]);
T
T = 8×2 table
code fruit ____ ______ q1 apple q3 orange q1 apple q2 banana q3 orange q3 orange q1 apple q3 orange
categories(T.fruit)
ans = 3×1 cell array
{'apple' } {'banana'} {'orange'}
pivot(T,Rows="code",Columns="fruit")
ans = 3×4 table
code apple banana orange ____ _____ ______ ______ q1 3 0 0 q2 0 1 0 q3 0 0 4
Simon
Simon 2023 年 4 月 7 日
Thanks for the code. Unfortunately it won't work for my real data because there are more than hundred different 'fruits', each one of which could have various 'flavors' of entries, typos or outright wrong entries. I couldn't know what they might be in advance. That's why I used forloop in Method 3 to gather all 'flavors' for each fruit AFTER crosstabing code x fruit. Nevertheless, your codes have taught me something new. And I am glad Matlabe has been updated with new functionalities for data cleaning. (I used Pythan/Pandas for that. I couldn't get used to Panda's indexing syntax, so I switched to Matlab. But I must say that Panda functionalities for group-based data wrangling are quite well-rounded.)
Peter Perkins
Peter Perkins 2023 年 4 月 9 日
Simon, I suspect that it would work simply by creating a categorical and looking at its categories. Hundreds is not a lot of categories. Some people have millions.
Simon
Simon 2023 年 4 月 11 日
Hi Peter, I think I agree with you. I did have tried with categorical, like categorical(A) .* categorical(B). I had a huntch that it would work. Then the thought of using unstack( ) took hold of me, so I started pursuing that idea. I had used stack( ) and unstack( ) to solved other problems and really enjoyed using them. Then, I ran into a wall, so I posted the problem here. Thankfully, a very neat solution crafted from unstack( ) was soon provided by Stenpen23.
Simon
Simon 2023 年 5 月 13 日
| > EDIT: I found a neater way:
|> G = findgroups(U.code);
|> U.count = grouptransform(ones(size(G)),G,@cumsum);
Sorry about being late response. I was overwhelmed by things. This is indeed a very neat solution. I used to think grouptransform( ) is quite limited in its function. But when it is put to work over a dummy/extra variable, it could be quite versatile in problem solving.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeLoops and Conditional Statements についてさらに検索

製品

リリース

R2023a

質問済み:

2023 年 4 月 1 日

コメント済み:

2023 年 5 月 13 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by