Replacing characters with integers in a very long string

3 ビュー (過去 30 日間)
Paolo Binetti
Paolo Binetti 2016 年 12 月 17 日
コメント済み: Star Strider 2016 年 12 月 18 日
I have a string of a few millions characters, want to replace it with a vector of integers according to simple rules, such as 'C' = -1 and so forth. My implementation works but takes forever and uses gigabytes of memory, in particular due to the str2num function, to my understanding. Is there a way to go more efficiently?
sequence = fileread('sourcefile.txt');
sequence_num = strrep(sequence, 'A', '0 ');
sequence_num = strrep(sequence_num,'C','-1 ');
sequence_num = strrep(sequence_num,'G', '1 ');
sequence_num = strrep(sequence_num,'T', '0 ');
sequence_num = regexprep(sequence_num,'\r\n','');
sequence_num = str2num(sequence_num);
sequence_num = int32(sequence_num);

採用された回答

Star Strider
Star Strider 2016 年 12 月 17 日
I don’t know what structure ‘sequence’ has. I created it as a cell array here:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20)); % Create Data
skew = zeros(1, length(sequence)+1,'int32'); % Preallocate
Cix = find(ismember(sequence, 'C')); % Logical Vector
Gix = find(ismember(sequence, 'G')); % Logical Vector
skew(Cix+1) = -1; % Replace With Integer
skew(Gix+1) = +1; % Replace With Integer
  7 件のコメント
Paolo Binetti
Paolo Binetti 2016 年 12 月 18 日
Thank you @Star and @Jan. All in your help sped up my code 700x times, now 0.17 s for a bacterium genome. About 250 times thanks to @Star suggestions, and 3 more times thanks to @Jan final simplification.
Star Strider
Star Strider 2016 年 12 月 18 日
Our pleasure!
It is always more gratifying to help with real-world research. We wish you well!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeString Parsing についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by