Levenshtein distance: String scanning

18 ビュー (過去 30 日間)
Marcel Dorer
Marcel Dorer 2018 年 12 月 13 日
コメント済み: per isakson 2018 年 12 月 14 日
I'm in desperate need of help with creating a Levenshtein distance function with scanning behavior.
Basically there is a target string (s) which is supposed to scan a longer experimental string (s2). For example a target string (s) of length m = 3 should be compared to the first 3 entries of an experimental string (s2). Afterwards the target string (s) should be "moved" by 1 step to be compared to the entries 2-4 of the experimental string (s2) and so on.
Creating such a scanning behavior is not a problem, the real problems are insertions and deletions. After an insertion or deletion I still want to compare 2 strings of equal length. Therefore when inserting another entry, the last entry of the experimental string (s2) has to be removed, and when deleting an entry the next value of the experimental string should be added. Simply spoken the string lengths (m and n) should always be constant and have the same length.
Here a simple example:
For the strings s = [1 2 3] and s2 = [1 3 2 3 ...] I want to compare parts of the same length m = 3, therefore the first comparison should be between s = [1 2 3] and s2 = [1 3 2]. A deletion at position #2 should now change s2 to [entry1 entry3 entry4], in this case to s2 = [1 2 3], leading to an edit distance of 1 (with a deletion cost of 1). Here, an insertion of the value 2 at position #2 should lead to s2 = [entry1 insertion entry2] and therefore also s2 = [1 2 3].
Any ideas how I could achieve this?
I'm using an Levenshtein function similar to this:
% Variables
s = [1 2 3];
m=length(s);
s2 = [1 3 2 3 1 2 3];
t = s2(1:1:m);
n=length(t);
% Edit Matrix
mat=zeros(m+1,n+1);
for i=1:1:m
mat(i+1,1)=i;
end
for j=1:1:n
mat(1,j+1)=j;
end
for i=1:m
for j=1:n
if (s(i) == t(j))
mat(i+1,j+1)=mat(i,j);
else
mat(i+1,j+1)= min([mat(i+1,j)+1,mat(i,j+1)+1,mat(i,j)+1]);
end
end
end
%Edit distance
d = mat(m+1,n+1);
  5 件のコメント
Marcel Dorer
Marcel Dorer 2018 年 12 月 14 日
編集済み: Marcel Dorer 2018 年 12 月 14 日
Ok sorry maybe I should've been even clearer. I meant:
I want to know if the long string (s2) contains the short string (s) or a variation of s which lies within my predefined Levenshtein distance threshold.
This is the reasoning behind me using the Levenshtein distance in the first place: To allow for some slight variation. As I wrote above: "I am mainly interested in the edit distance outputs (d), more specifically if those values are below a predefined threshold."
So my goal is to get those values d, which should be the result of comparisons of 2 strings of equal length, with Insertions and Deletions not modifying the length but instead removing the surplus or refilling to length(s) with the next values of s1.
My code above works for basic Levenshtein. The scanning of one string with the other is not the problem. The problem is the Operations Insertion and Deletion altering the length of one of the strings. String length should still be the same after every operation.
By "simply calculating the distance" the algotirhm changes those lengths because it adds and removes values as part of its process and this should not happen. Sure your variables aren't overwritten, but the edit matrix is based on adding and deleting values of your strings.
Hopefully this will make my problem clear:
The edit matrix is always a square, and Deletions and Insertions are vertical and horizontal movements. Therefore for every Deletion there has to be an Insertion later on and for every Insertion there needs to be a Deletion later on. This should not happen, instead, after those operations the algorithm should continue with a new edit matrix by either appending t with the next value of s2 (in case of Deletion) or removing the last value of t (in case of Insertion).
Basically the suplementary Deletion/Insertion (which is vertical or horizontal movement) following every Insertion/Deletion should be done automatically by continuing from a new edit distance matrix.
Edited with additional clarification
per isakson
per isakson 2018 年 12 月 14 日

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeGet Started with MATLAB についてさらに検索

製品


リリース

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by