Read text file lines and analyze

1 回表示 (過去 30 日間)
Lmm3
Lmm3 2017 年 7 月 24 日
回答済み: OCDER 2017 年 9 月 9 日
I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:
>Rosalind_4949
ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA
>Rosalind_7490
AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT
>Rosalind_8337
CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA
I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.
I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.
fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
templine = fgetl(fid);
a = strcmp(templine, '>');
if a == 0
G = length(strfind(templine,'G'));
C = length(strfind(templine,'C'));
z = length(templine);
%Per = (G+C)*100/z
end
end
Per = (G+C)*100/z

採用された回答

Lmm3
Lmm3 2017 年 9 月 9 日
The following code is what I used to read from the data file and determine %GC:
fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
templine = fgetl(fid);
a = strfind(templine, '>');
TF = isempty(a);
if TF == 1;
n= n+1;
G(1) = 0;
C(1) = 0;
z(1) = 0;
G(n) = length(strfind(templine,'G'));
C(n) = length(strfind(templine,'C'));
z(n) = length(templine);
G(n) = G(n) + G(n-1);
C(n) = C(n) + C(n-1);
z(n) = z(n) + z(n-1);
continue
% Per(n) = (G(n)+C(n))*100/z(n)
else TF == 0 ;
Per = (G(end)+C(end))*100/z(end)
disp(templine)
G(:,:) = [];
C(:,:) = [];
z (:,:)=[];
continue
end
end
Per =(G(end)+C(end))*100/z(end)

その他の回答 (2 件)

KSSV
KSSV 2017 年 7 月 24 日
編集済み: KSSV 2017 年 7 月 24 日
Let data.txt be your text file...You can count the number of G in your file as below:
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
N = N+length(strfind(S{i}, 'G'));
end
Without loop :
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;
  1 件のコメント
Lmm3
Lmm3 2017 年 7 月 25 日
KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

サインインしてコメントする。


OCDER
OCDER 2017 年 9 月 9 日
If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).
Also, cellfun and regexp become pretty handy tools.
To get GC %:
[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
48.5876
45.0000
55.1724

カテゴリ

Help Center および File ExchangeCell Arrays についてさらに検索

タグ

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by