String subsequence tools

バージョン 1.1.0.0 (4.97 KB) 作成者: John D'Errico
Identify common substrings of a pair of strings
ダウンロード: 1.2K
更新 2010/6/1

ライセンスの表示

Zhiping XU's submission on the FEX intrigued me. I knew it had to be doable more efficiently. Long strings will be common, so it makes much sense to have an efficient code. You might find these tools interesting for inspecting strings of DNA bases, or for checking a student's homework submission for plagiarized content. Surely there are other uses too.

The commonsubstring.m function does this search fairly efficiently (though I am sure it too can be enhanced.)

Generate a pair of long random letter sequences, then determine the longest common substring between them. In the following example, the original strings each had 10^5 random elements in them.

bases = 'acgt';
str1 = bases(ceil(rand(1,100000)*4));
str2 = bases(ceil(rand(1,100000)*4));

tic,[substr,ind1,ind2] = commonsubstring(str1,str2);toc
Elapsed time is 16.650532 seconds.

There were two substrings found of the maximum length (16) characters. These substrings started at locations 22189 and 74425 in str1, and at locations 64948 and 32833 in str2.

substr =
gctttagggcgtacgc
cttcggataccttgtt

ind1 =
[22189]
[74425]

ind2 =
[64948]
[32833]

For a second example, commonsubstring can find all common substrings of a given fixed length.

str1 = char('a' + round(rand(1,100)*1.5))
str1 =
bbbabbbbbabbbbbbabbbbbabbababbbabbabbbbbbabbbbbbbbbbbbbabbaaabbbbbaabbbbbbbbbbabbbbbaaabbabbaabbbbbb

str2 = 'aaabbabbb';

substr,ind1,ind2] = commonsubstring(str1,str2,3)
substr =
aaa
aab
abb
bab
bba
bbb

ind1 =
[1x2 double]
[1x4 double]
[1x15 double]
[1x11 double]
[1x15 double]
[1x19 double]

ind2 =
[ 1]
[ 2]
[1x2 double]
[ 5]
[ 4]
[ 7]

In addition, I've also included the function substrings.m. It returns a list of all substrings of a given string. In the example below, it returns all distinct substrings of length 3 from str1 above.

substrings(str1,3,1)
ans =
aaa
aab
aba
abb
baa
bab
bba
bbb

引用

John D'Errico (2024). String subsequence tools (https://www.mathworks.com/matlabcentral/fileexchange/27460-string-subsequence-tools), MATLAB Central File Exchange. 取得済み .

MATLAB リリースの互換性
作成: R2010a
すべてのリリースと互換性あり
プラットフォームの互換性
Windows macOS Linux
カテゴリ
Help Center および MATLAB AnswersCharacters and Strings についてさらに検索
タグ タグを追加

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
バージョン 公開済み リリース ノート
1.1.0.0

Bug repair, an indexing problem between the strings.