Find common 4-letter substrings in a list of strings

2 ビュー (過去 30 日間)
Mihaela Mihailescu
Mihaela Mihailescu 2023 年 2 月 15 日
編集済み: dpb 2023 年 2 月 16 日
How do I find ANY (4-letter) common pattern (substring) among all strings in the ID column in the table below, and create another table per common pattern found.
A note that I only have the license for 2018a, so I cannot use fancy functions like 'lettersPattern' or so.
I have attached an excerpt of the .csv file.
Thanks!
  6 件のコメント
Walter Roberson
Walter Roberson 2023 年 2 月 15 日
so because of lines 6 and 7, you would like the output to have GNNR NNRP NRPY RPYI and so on? Any 4-character subset that occurs anywhere else should be reported on, and if the file has 5 characters in a row in common then that is two four-character phrases?
Or will the user be asking to search for a particular list of 4 characters and the twist is that if it is only found once you say not found but if it occurs at least twice you locate all occurrences??
Stephen23
Stephen23 2023 年 2 月 15 日
"How do I find ANY (4-letter) common pattern (substring) among all strings in the ID column in the table below."
There are no 4-letter substrings that occur in all of of the strings in the ID column.
Or do you really mean that something like "...that occur in two or more of the strings in the ID column" ?

サインインしてコメントする。

採用された回答

dpb
dpb 2023 年 2 月 15 日
編集済み: dpb 2023 年 2 月 16 日
May be something more clever, but the "deadahead" way for the one string looks something like --
tAMP=readtable('AMPdb_short.csv'); % bring the data in...
SUBSTRLEN=4; % the substring length
L=SUBSTRLEN-1;
for j=1:height(tAMP); % iterate over all strings
S=tAMP.ID{j}); % convenient temporary
for i=1:length(S)-L % keep in bounds of string
s=S(i:i+L); % the ith substring in the string
ix=strfind(S,s); % find the locations if any in this string
if numel(ix)>1 % if are any, display...do whatever you wish here
fprintf(['%3d' '%5s' repmat('%3d',1,numel(ix)) '\n'],i,s,ix)
end
end
end
That finds all the matches within each string; to do across all requires wrapping it in another layer to iterate also not only the comparison of the ith substring over the jth string but also over all others in the collection. Leave as "exercise for Student"...
For the longest substring found in the sample dataset (60 characters) the above inner loop over that string alone produced (run locally)...
>> for i=1:length(S)-3
s=S(i:i+3);
ix=strfind(S,s);
if numel(ix)>1,fprintf(['%3d' '%5s' repmat('%3d',1,numel(ix)) '\n'],i,s,ix),end
end
4 RPRP 4 12 14
11 PRPR 11 13
12 RPRP 4 12 14
13 PRPR 11 13
14 RPRP 4 12 14
15 PRPL 15 29 43 57
16 RPLP 16 30 44
17 PLPF 17 31 45
18 LPFP 18 32 46
...
50 RPGP 22 36 50
51 PGPR 23 37 51
52 GPRP 24 38 52
53 PRPI 25 39 53
54 RPIP 26 40 54
55 PIPR 27 41 55
56 IPRP 28 42 56
57 PRPL 15 29 43 57
>>
The above, of course, finds duplicates so the overall number will be the unique combination of the above -- well, let's see
smatch={};
for i=1:length(S)-3
s=S(i:i+3);
ix=strfind(S,s);
if numel(ix)>1,smatch=[smatch;s];end
end
smatch=unique(smatch)
smatch =
16×1 cell array
{'FPRP'}
{'GPRP'}
{'IPRP'}
{'LPFP'}
{'PFPR'}
{'PGPR'}
{'PIPR'}
{'PLPF'}
{'PRPG'}
{'PRPI'}
{'PRPL'}
{'PRPR'}
{'RPGP'}
{'RPIP'}
{'RPLP'}
{'RPRP'}
>>
You can be a little more clever by checking whether the next substring is already in the set of matches if you keep the running array of indices during the loop and break the loop instead of searching again for the same pattern.
  1 件のコメント
Mihaela Mihailescu
Mihaela Mihailescu 2023 年 2 月 15 日
Wow! That really works - thanks!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeCharacters and Strings についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by