Match names from two different columns - Comparing strings of different lengths

Question

Maria 2014 年 9 月 6 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths

コメント済み: Maria 2014 年 9 月 6 日

I have a cell variable A with about 4000 rows and 5 columns:

     C1    c2                   c3        c4          c5
  A={1997 'Michelle Applebaum' 'Salmon'  'BASIC'     'STEEL'
    1997  'Jambardella Arnold' 'Butter'  'BASIC'     'STEEL'
    1999  'Cai von Rumohr'     'Cow'     'CAPITAL'   'AEROPLANE'
    2011  'Pierre Smith'       'Milk'    'GOOD'      'AEROPLANE'
    2004  'Jinder Kauffman'    'Star'    'CAPITAL'   'PHONE'

And I have a second cell variable B with about 200 000 rows and 8 columns:

     c1         c2      c3       c4     c5       c6                     c7      c8
  B={2013  29  2225  'ELD1'  29  'SMITH         P'  4817  'HAYWOOD'
     2013  70  2628  'CCRN'  70  'FRANCE        J'  11688  'CANTORFZ'
     2013  02  952  'ABFS'  02  'KAUFFMAN      J'  356  'BUCK'
     2013  20297  157  'DUK'  20297  'ARNOLD        J'  1382  'LAWRENCE'
     2013  78362  260  'APA'  78362  'ARIF          A'  2213  'STIFEL'}

The focus is on c2 of A and on c6 of B.

C2 of A gives the complete first name and last name (and sometimes other names in between) of an individual.
C6 of B gives the last name (in capital letters) and only the initial of the first name of an individual.

I am trying to match both cells . So in case both last names are the same (or silmilar) I would like to add the columns of B to A.

       C1    c2                   c3       c4      c5
    A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
       1997 'Jambardella Arnold ' 'Butter' 'BASIC' 'STEEL' 2013 20297 157 'DUK' 20297 'ARNOLD  J' 1382 'LAWRENCE'  
       1999 'Cai von Rumohr'     'Cow'    'CAPITAL' 'AEROPLANE'
      2011  'Pierre Smith'       'Milk'   'GOOD'    'AEROPLANE' 2013 29  2225 'ELD1' 29 'SMITH P'  4817  'HAYWOOD'
      2004  'Jinder Kauffman'     'Star'   'CAPITAL' 'PHONE'     2013 02  952  'ABFS' 02'KAUFFMAN J' 356  'BUCK'

I never matched cells comparing names, and the function has to be case insensitive, ignore points, commas or spaces, and also ignore if the same letter appears twice in a row. I say this because variable A was wrote by me by hand, so it's possible the last name is not exactly equal in both variables. `strrep & strtrim` functions can help solving the problem.

Can someone help me please? Thank you.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Guillaume 2014 年 9 月 6 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150945

MATLAB Online で開く

You've not specified what happened to compound surnames (like 'von Ruhmor' in your example) or first names (like 'Jean-Pierre'). it also appears that the fields in B are fixed width so what happens to very long surname? So, I've assumed that only the last part of a compound surname is in B, only the initial of compound first name is in B and for very long surname B is larger. I'm also ignoring case where names don't match exactly:

%extract names from A and transform in uppercase 'SURNAME 1STLETTEROFNAME':
namesfromA = upper(regexprep(A(:,2), '([A-Z]).* ([A-Z][a-z]+)', '$2 $1'));
%x=extract names from B and remove extra spaces:
namesfromB = regexprep(B(:, 6), '([A-Z]+) +([A-Z])', '$1 $2');
%find the intersection and postion of matches. Assume there is only ever one match
[~, ia, ib] = intersect(namesfromA, namesfromB);
%add matches to A:
A(ia, 6:13) = B(ib, :);

If surnames don't match exactly, then it gets a lot more complicated. There are a numbers of algorithm ( Damerau-Levenshtein, Hamming) that allows you to find how two strings are similar, but I don't think any of them are built-in in matlab.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Maria 2014 年 9 月 6 日

MATLAB Online で開く

Thank you very much for all the information. I know that getting all the matches will be a problem. I was first thinking of doing something like this:

A(:, end+1) = lower(strrep(strtrim(A(:,2)),' ',''));
B(:,end+1)= lower(strrep(strtrim(B(:,6)),' ',''));

But then I cannot use strcmpi or strncmpi because I am working with different sizes & content even if there is a match. So I was trying to build something that for instance would do a match if they find 5 letters in a row that are equal.

But anyway, your code works and is really helpful! I will read more about the numbers of algorithm. Thank you again.

サインインしてコメントする。

Answer 2

Stephen23 2014 年 9 月 6 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150944

You are trying to match words in two cell arrays, including allowing for punctuation characters and possibly slightly different spellings... not an easy task to perform! You will also need to consider initials and name order.

It really depends on how different the strings might be: if the differences only include repeated characters, then you might be able to get away with creating some regexp pattern to help with that.

If the differences are more complicated, then you need to find a similarity measure. Some common similarity measures are the Rabin-Karp algorithm, the Levenshtein distance, the Needleman–Wunsch algorithm, or the Hamming distance.

You will also find submissions on MATLAB File Exchange that support several of these measures for analyzing string similarity, and plenty of examples online.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Maria 2014 年 9 月 6 日

Thank you very much for all the information. I will do some research about the topic of similarity measure. Thank you.

サインインしてコメントする。

Match names from two different columns - Comparing strings of different lengths

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

Match names from two different columns - Comparing strings of different lengths

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示