How to searh for very similar strings?

Question

pietro 2019 年 1 月 25 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/441549-how-to-searh-for-very-similar-strings

コメント済み: O.Hubert 2024 年 2 月 1 日

Hi all,

I am doing a bibliometric analysis and especially, I have to search article titles on references of the citing papers. Here, you can see my code:

for iMS=1:length(MS)
   Cit{iMS}=contains({MSCit.References},MS(iMS).Title,'IgnoreCase',true);
end

The code works pretty well, however the data that I can export from Scopus is not perfect. Indeed, article names are not consistent, so the perfect match does not always work. Here two examples:

Case 1:

Real article name: 'Biomethane production from different crop systems of cereals in Northern Italy'

Article name in the reference: 'Biomethane production from different crop systems of cereals in Nothern Italy'

Case 2:

Real article name: 'Methodology for the realisation of accelerated structural tests on tractors'

Article name in the reference: 'Methodology for the realization of accelerated structural tests on tractors'

As you can see, the two titles differ of a tiny character. Due to the fact that I have more than 20000 papers and fixing it by hand can be time-consuming, is there any way to programmatically search for very similar strings? As you can see, the strings might change also in length.

Thank you,

Cheers

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

John D'Errico 2019 年 1 月 25 日

3
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/441549-how-to-searh-for-very-similar-strings#answer_358076

編集済み: John D'Errico 2019 年 1 月 25 日

You probably want to do some reading here:

https://en.wikipedia.org/wiki/Levenshtein_distance

Plus, I see lots of code provided.

https://blogs.mathworks.com/cleve/2017/08/14/levenshtein-edit-distance-between-strings/

https://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings

https://www.mathworks.com/matlabcentral/fileexchange/60855-jarowinkler

https://www.mathworks.com/matlabcentral/fileexchange/36981-find-nearest-matching-string-from-a-set

I'm sure some of those are better than others. And I would never count out anything written by Cleve.

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

pietro 2019 年 1 月 26 日

MATLAB Online で開く

Hi all,

thanks for your precious feedbacks. MSCit is a struct of record of 21'000 records and MS is a struct with 2000 records. Each 'Referece' field of MSCit is composed of about 10'000 characters, while the 'Title' record of MS is composed of about 100 characters. So, I have thought to use a fuzzy search approach, that works, but I have to use a double-for (like the code below), so the computation time is very long.

ProvaCit=[];
for iCit=1:length(MSCit) 
    [d A] = fzsearch(lower(MSCit(iCit).References),lower(MS(iMS).Title));
    if d<3
       ProvaCit=[ProvaCit, iCit];
    end
  end

I have tought to do the following

[d A] = fzsearch({lower(MSCit(iCit).References)},lower(MS(iMS).Title));

but no real change. how could I speed-up the code? I thought to use a more stable parameter to limit the call of fzsearch. So, I tried to search for articles with similar authorships in the references with contains and then use fzsearch only in these articles. However, niether the author names are consistent. For example, I have found 'González' e 'Gonzalez'. Is there any easy and fast way to deal with this type of situation?

O.Hubert 2024 年 2 月 1 日

Certainly too late, but you could remove the accents and special characters from the string prior to running fzsearch.

Similar to this response. Adaptation to MATLAB is required, though, but you get the idea.

サインインしてコメントする。

How to searh for very similar strings?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

How to searh for very similar strings?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示