Finding the repeated substrings

3 ビュー (過去 30 日間)
Reshma Ravi
Reshma Ravi 2017 年 6 月 1 日
回答済み: Steven Lord 2019 年 8 月 14 日
I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .
  2 件のコメント
Stephen23
Stephen23 2017 年 6 月 1 日
See Andrei Bobrov's answer for an efficient solution.
Andrei Bobrov
Andrei Bobrov 2017 年 6 月 2 日
Thank you Stephen!

サインインしてコメントする。

採用された回答

KSSV
KSSV 2017 年 6 月 1 日
str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

その他の回答 (2 件)

Andrei Bobrov
Andrei Bobrov 2017 年 6 月 1 日
A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});
  5 件のコメント
Stephen23
Stephen23 2018 年 8 月 26 日
tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.
Ivan Savelyev
Ivan Savelyev 2019 年 8 月 14 日
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.

サインインしてコメントする。


Steven Lord
Steven Lord 2019 年 8 月 14 日
For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}

カテゴリ

Help Center および File ExchangeGenomics and Next Generation Sequencing についてさらに検索

タグ

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by