Vectorizing multiple string comparison

4 ビュー (過去 30 日間)
Paolo Binetti
Paolo Binetti 2017 年 1 月 26 日
コメント済み: Paolo Binetti 2017 年 1 月 28 日
Is there a way to significantly speed up this loop, perhaps by vectorizing it? Inputs in attachment. I do not have a Matlab version with "string" functions.
d = a';
for i = 1:numel(a)
d{i} = c(strcmp(a{i}, b), :);
end
I tried working my way from the inner part with cellfun, but either I am not getting it right or it is not the good approach:
aux = cellfun(@strcmp, a, b); % does not work
  2 件のコメント
Walter Roberson
Walter Roberson 2017 年 1 月 27 日
That file is an Octave file that would take a bunch of work to read in MATLAB.
This is the wrong resource to be asking about performance improvement for Octave.
Paolo Binetti
Paolo Binetti 2017 年 1 月 27 日
You are right. R2016 does not run on the PC I mostly use, and old beast which still works perfectly, but on XP. So until I buy a new computer, I am stuck with either a much older version of Matlab or Octave, which does run on XP. I could have generated the input with my older Matlab. And your answer below gives me one more motivation to buy a new computer soon!

サインインしてコメントする。

採用された回答

Guillaume
Guillaume 2017 年 1 月 26 日
One obvious minor speed-up is to get rid of the find that serves absolutely no purpose. You can directly use the logical vector returned by strcmp:
d{i} = c(strcmp(a{i}, b)), :);
For some reason, I cannot load your mat file. I'm going to assume that a is a cell array of string, and so is b (otherwise the loop would not be needed). Assuming that there are no repeated strings in b:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
d = cell(size(a))';
[isfound, loc] = ismember(a, b);
d(isfound) = c(loc(isfound), :);
If it's guaranteed that all elements of a are found in b, then you can simplify even further to:
assert(numel(unique(b)) == numel(b), 'This code does not work when there are duplicate values in b');
[isfound, loc] = ismember(a, b);
assert(all(isfound), 'The next line only works if all elements of a are in b');
d = num2cell(c(loc, :), 2);
  2 件のコメント
Paolo Binetti
Paolo Binetti 2017 年 1 月 27 日
編集済み: Paolo Binetti 2017 年 1 月 27 日
Thank you.
  • Good on you for getting rid of "find". I have edited the question accordingly.
  • I am sorry that you could not download my inputs.mat file. I have uploaded it again, I have tested it and it seems to work for me.
  • Nevertheless, all of your assumptions were right, except that b does actually contain repeated strings, unfortunately (if it did not, the "intersect" function would allow to vectorize).
Guillaume
Guillaume 2017 年 1 月 27 日
編集済み: Guillaume 2017 年 1 月 27 日
According to Walter, your mat file is an octave file that matlab can't open.
If there are duplicate values in b, then you don't have a choice but to use a loop, either explicitly as you have done or with cellfun:
d = cellfun(@(aa) c(strcmp(aa, b), :), a, 'UniformOutput', false);
It's very possible that the cellfun may be slower than the explicit loop (due to the anonymous function call).
edit: in matlab R2016b there is a an extremely easy way to vectorise the string comparison, using the new string class:
string(a) == string(b)'
but you'd still need a loop or cellfun afterward to create the d cell array:
d = cellfun(@(r) c(r, :), num2cell(string(a) == string(b)', 1), 'UniformOutput', false)

サインインしてコメントする。

その他の回答 (1 件)

Walter Roberson
Walter Roberson 2017 年 1 月 27 日
ismember can be used between cell arrays of strings. The two-output version can be used to find the indices, which you can then use to index into c.
  3 件のコメント
Walter Roberson
Walter Roberson 2017 年 1 月 27 日
Flip the order around, ismember(b, a) .
Paolo Binetti
Paolo Binetti 2017 年 1 月 28 日
I had a feeling I was missing an obvious point. Thank you for pointing it out! The modified code, below, runs much faster. I tried to vectorize the remainder of the loop, to no avail, but the costly string comparison at least if out of the loop.
a = { 'AAG' 'AGA' 'ATT' 'CTA' 'CTC' 'GAT' 'TAA' 'TCT' 'TTC' };
b = { 'AAG' 'AGA' 'GAT' 'ATT' 'TTC' 'TCT' 'CTC' 'TCT' 'CTA' 'TAA' 'AAG' };
c = [ 'AGA';'GAT';'ATT';'TTC';'TCT';'CTC';'TCT';'CTA';'TAA';'AAG';'AGA' ];
[temp, idx] = ismember(b, a);
d = a';
for i = 1:numel(a)
d{i} = c(i == idx, :);
end

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeOctave についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by