Replacing characters with integers in a very long string

5 ビュー (過去 30 日間)
Paolo Binetti
Paolo Binetti 2016 年 12 月 17 日
コメント済み: Star Strider 2016 年 12 月 18 日
I have a string of a few millions characters, want to replace it with a vector of integers according to simple rules, such as 'C' = -1 and so forth. My implementation works but takes forever and uses gigabytes of memory, in particular due to the str2num function, to my understanding. Is there a way to go more efficiently?
sequence = fileread('sourcefile.txt');
sequence_num = strrep(sequence, 'A', '0 ');
sequence_num = strrep(sequence_num,'C','-1 ');
sequence_num = strrep(sequence_num,'G', '1 ');
sequence_num = strrep(sequence_num,'T', '0 ');
sequence_num = regexprep(sequence_num,'\r\n','');
sequence_num = str2num(sequence_num);
sequence_num = int32(sequence_num);

採用された回答

Star Strider
Star Strider 2016 年 12 月 17 日
I don’t know what structure ‘sequence’ has. I created it as a cell array here:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20)); % Create Data
skew = zeros(1, length(sequence)+1,'int32'); % Preallocate
Cix = find(ismember(sequence, 'C')); % Logical Vector
Gix = find(ismember(sequence, 'G')); % Logical Vector
skew(Cix+1) = -1; % Replace With Integer
skew(Gix+1) = +1; % Replace With Integer
  7 件のコメント
Paolo Binetti
Paolo Binetti 2016 年 12 月 18 日
Thank you @Star and @Jan. All in your help sped up my code 700x times, now 0.17 s for a bacterium genome. About 250 times thanks to @Star suggestions, and 3 more times thanks to @Jan final simplification.
Star Strider
Star Strider 2016 年 12 月 18 日
Our pleasure!
It is always more gratifying to help with real-world research. We wish you well!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeLow-Level File I/O についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by