How to read specific lines from a text file and store them in an array?

1 回表示 (過去 30 日間)
Rasif Ajwad
Rasif Ajwad 2015 年 10 月 20 日
コメント済み: Rasif Ajwad 2015 年 10 月 20 日
I have a text file containing an Multiple Sequence Alignment (MSA) which has protein sequences stored in it. The contents of the file is like this:
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
From this file I just want to extract the lines containing the actual sequences (ones NOT starting with '>' symbol) and store them in an array for future use. One thing to mention is that line 2 and line 3 is one single sequence, so I also need to make them a single string and store it in one single position of an array. How can I do that?
I wanted to use 'fileread' but it reads all the file at a time, so it's not helpful.
  3 件のコメント
TastyPastry
TastyPastry 2015 年 10 月 20 日
編集済み: TastyPastry 2015 年 10 月 20 日
To clarify, are your sequences supposed to be formatted like this, where there are two separate sequences starting with >gi?
>gi|73961569|ref|XP_547536.2| osteocalcin [C. lupus familiaris]
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGL
GAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>gi|27806301|ref|NP_776674.1| osteocalcin preproprotein
MRTPMLLALLALAT--LCLAGRADAKPGDAESGK-GAAFVSKQEGSEVVKRLRRYLDHWL
GAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV-
Rasif Ajwad
Rasif Ajwad 2015 年 10 月 20 日
Yes. sequences will start with '>gi', but the actual sequence is starting from the next line: 'MRSLM...'

サインインしてコメントする。

回答 (1 件)

per isakson
per isakson 2015 年 10 月 20 日
Try
>> out = cssm
out =
[1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char] [1x104 char]
>> out{3}
ans =
MRSLMVLALLAVAALCLCLAGPADAKPSSAESRKGGATFVSKREGSEVVRRLRRYLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV-
>>
where
function out = cssm
str = fileread( 'cssm.txt' );
cac = regexp( str, '(?<=>gi[^\n]+\n).+?(?=\n>gi|$)', 'match' );
out = cell(1,length(cac));
for jj = 1 : length( cac )
out{jj} = regexprep( cac{jj}, '\n', '' );
end
end
and cssm.txt contains three copies of the string of your question.

カテゴリ

Help Center および File ExchangeCharacters and Strings についてさらに検索

タグ

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by