I want to read similar column headers and standardize their names

Question

Amanda Figueroa 2019 年 2 月 4 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/443172-i-want-to-read-similar-column-headers-and-standardize-their-names

回答済み: dpb 2019 年 2 月 11 日

Let's say I have a few csv files with column headers that are similar, but they actually represent the same data (the file header names (VariableNames) change per data file and I can't change this). How can I standardize the filenames so I can use detectImportOptions?

Here are the two column headers:

Data1= {'AnnotatedSequence','x_ProteinGroups','x_Proteins','x_MissedCleavages', ...
        'Charge','m_z_Da_','MH__Da_','DeltaM_ppm_','Deltam_z_Da_', ...
        'IonInjectTime_ms_', 'IsolationInterference___','RT_min_','PercolatorPEP'}
Data2 ={'AnnotatedSequence','Modifications','NumberOfProteinGroups', ...
        'NumberOfProteins', 'NumberOfMissedCleavages','Charge','mzInDa', ...
        'MHplusInDa','TheoMHplusInDa', 'DeltaMInPpm','DeltaMzInDa', ...
        'Intensity','IsolationInterferenceInPercent', 'IonInjectTimeInMs', ...
        'RTInMin','ApexRTInMin','PercolatorQ_Value','PercolatorPEP', ...
        'PercolatorSVMScore'}

As you can see, some column headers in Data1 (e.g. x_ProteinGroups) contain the same type of data as Data2 (e.g. NumberOfProteinGroups). How can I standardize the name of the header so detectImportOptions can read any of these files?

I tried using strtok(Data1,'_') but that just removes the underscore from the files and I still have other string components that need to be removed. strcmpi would be useful if it didn't have a size requirement.

Your help is greatly appreciated.

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

dpb 2019 年 2 月 5 日

MATLAB Online で開く

Is the following how you would match the two?

Data2=                                     Data1=      
 {'AnnotatedSequence'; ...             1  'AnnotatedSequence'; ...
  'Modifications'; ...
  'NumberOfProteinGroups'; ...         2  'x_ProteinGroups'; 
  'NumberOfProteins'; ...              3  'x_Proteins'; ...
  'NumberOfMissedCleavages'; ...       4  'x_MissedCleavages'; ...
  'Charge'; ...                        5  'Charge'; ...
  'mzInDa'; ...                        6  'm_z_Da_'; ...
  'MHplusInDa'; ...                    7  'MH__Da_'; ...
  'TheoMHplusInDa'; ... 
  'DeltaMInPpm'; ...                   8  'DeltaM_ppm_'; ...
  'DeltaMzInDa'; ...                   9  'Deltam_z_Da_'; ... 
  'Intensity'; ...
  'IsolationInterferenceInPercent';   11  'IsolationInterference___'; ...
  'IonInjectTimeInMs'; ...            10  'IonInjectTime_ms_'; ...
  'RTInMin'; ...                      12  'RT_min_'; ...
  'ApexRTInMin'; ...
  'PercolatorQ_Value'; ...
  'PercolatorPEP'; ...                13  'PercolatorPEP'}
  'PercolatorSVMScore'}

What would you choose as the preferred nomenclature?

I, personally, detest extremely long names so I'd probably shorten them all significantly, but whatever you pick can work...the problem will be if there are any number of other perturbations, then it gets harder.

dpb 2019 年 2 月 8 日

編集済み: dpb 2019 年 2 月 8 日

Well, you can't solve a problem without a clear definition of what the problem really is... :)

If it were me, I'd start with simple and only get more complex as it was demonstrated necessary to be so.

I'd also go complain to the source of the data to have them provide a roadmap if not consistency when they generate/supply the data.

What I was really suggesting (and used the rearranging of the supplied data to illustrate) is that there are matching patterns between the two sets that are unique for certain variables on a 1:1 basis for a subset of the variables, then another correlation for another subset, and so on...I just left the indices with the two to ensure you were in agreement that I had identified the match correctly to make further reference more simple to a given label.

The simple-minded way is to itemize those patterns on each side in lookup tables and just do substitution of each to the other to whatever is chosen as the final substring for that pattern combination to use in the final variable name.

In some cases just a simple startswith() will be enough; in some others will take some thinking about regular expressions to hit the "In"<-->"_" places where it's embedded in the string and to ensure you don't inadvertently find another string match beside the ones you're really looking for.

Adam Danz 2019 年 2 月 8 日

MATLAB Online で開く

It sounds like a mess but it's not uncommon. Often times data comes from different sources with different formats and the cleaning stage is inevitable.

You'll probably need to use a mix of strategies all arranged in a set of well thought-out conditionals. For example, start by search for embedded key words. If there are no matches, you might be able to search by matching expected values in the data. If there is more than one match, you'll need a second conditional to eliminate the incorrect match.

Some data might be integers (categorical data, counts, etc). Some data might have expected values between 0 and 1. Some data might be expected to be sparse, filled with mostly 0s. All of these are testable and you should include lots of sanity checks in this process. Here's an example sanity check

%These data should all be positive, non-zero integers.  
if mod(x, 1) ~= 0 | x <=0 
    error('Sanity Check Failure.  Data are not positive integers.')
end

Amanda Figueroa 2019 年 2 月 8 日

編集済み: Amanda Figueroa 2019 年 2 月 8 日

Hi dbp,

Yeah, I get your drift and I did write these down on paper and found the patterns. The problem is that many variables start and/or end the same ('DeltaMInPpm' , 'DeltaMInDa' , 'mzInDa','DeltaScore')... So I keep hitting these roadblocks.

I know the problem is the naming convention. I can talk to the company to fix this moving forward, but the data I have is already stored with these column names (countless experiments). I need to analyze these files that are already finished.

Thanks for the help, I will keep working on fixing this issue.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

dpb 2019 年 2 月 11 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/443172-i-want-to-read-similar-column-headers-and-standardize-their-names#answer_360436

MATLAB Online で開く

Picking up where left off above with two string arrays of quoted string column labels from the two shown-to-be-different files...

>> [lia,lib]=ismember(s3,s1);        % see who does match (shorter file first this time)
>> whos li*
  Name       Size            Bytes  Class      Attributes
  lia       31x1                31  logical              
  lib       31x1               248  double               
>> s3(lia)                          % the list of who does match between the two...
ans = 
  19×1 string array
    "Checked"
    "Confidence"
    "Identifying Node"
    "PSM Ambiguity"
    "Annotated Sequence"
    "Modifications"
    "Master Protein Accessions"
    "Protein Accessions"
    "Charge"
    "Rank"
    "Search Engine Rank"
    "Activation Type"
    "MS Order"
    "First Scan"
    "Spectrum File"
    "Ions Matched"
    "XCorr"
    "Percolator q-Value"
    "Percolator PEP"
>>
Now see who doesn't match from this file...
>> s3(~lia)
ans = 
  12×1 string array
    "# Protein Groups"
    "# Proteins"
    "# Missed Cleavages"
    "DeltaScore"
    "DeltaCn"
    "m/z [Da]"
    "MH+ [Da]"
    "DeltaM [ppm]"
    "Deltam/z [Da]"
    "Isolation Interference [%]"
    "Ion Inject Time [ms]"
    "RT [min]"
>>
Then reverse the seearch longer array first...
    
>> [lia,lib]=ismember(s1,s3);
>> s1(~lia)             % we alread knows who matcches
ans = 
  31×1 string array
    "Identifying Node Type"
    "Search ID"
    "Identifying Node No"
    "Sequence"
    "Number of Protein Groups"
    "Number of Proteins"
    "Protein Descriptions"
    "Number of Missed Cleavages"
    "Original Precursor Charge"
    "Delta Score"
    "Delta Cn"
    "Concatenated Rank"
    "mz in Da"
    "MHplus in Da"
    "Theo MHplus in Da"
    "Delta M in ppm"
    "Delta mz in Da"
    "Matched Ions"
    "Total Ions"
    "Intensity"
    "Isolation Interference in Percent"
    "Ion Inject Time in ms"
    "RT in min"
    "Last Scan"
    "Master Scans"
    "Annotation"
    "Peptides Matched"
    "Area"
    "Apex RT in min"
    "Percolator SVMScore"
    "Precursor Quan Result ID"
>> 

there are the two lists and where the matches/mismtches are located between the two files. All you need to doo is to line those back up again and use the string of one to find the matching and replace both with the wanted name going forward.

I'd probably replace all the blanks in the names with underscores as the first step before the anything else but that's just one way to do it.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Answer 2

Amanda Figueroa 2019 年 2 月 8 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/443172-i-want-to-read-similar-column-headers-and-standardize-their-names#answer_360220

編集済み: Amanda Figueroa 2019 年 2 月 10 日

I found out what happened. The software output has the exact same column names (which I discovered by dragging each type of file into the workspace.

I import these files into matlab with detectImportOptions and it seems like detectImportOptions is what adds the "_" values

I've been working on fixing this for a week now to find out that it is a function. Lol

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

dpb 2019 年 2 月 10 日

編集済み: dpb 2019 年 2 月 11 日

MATLAB Online で開く

fid=fopen('buffer.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s1=string(l{:});
fid=fopen('DMSO.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s2=string(l{:});
fid=fopen('NBA.txt','r');
l=fgetl(fid);
title3=textscan(l,'%q')
fid=fclose(fid);
l=textscan(l,'%q');
s3=string(l{:});

Preliminaries out of way having read the header lines, let's see something about what we got...

>> all(ismember(s1,s2))==1
ans =
       1
>>

So, the first two files are identical with 50 columns. The third is the odd man out as it only had 30 columns, but we proved the other two are actually the same, not just happen to have the same number of columns.

>> [s1(1:30) s3(1:30)]

displays the first 30 of the one and all the third and illustrates that the problem is there are blanks in the column names and differences in what they are labelled besides.

Given the limited number of mismatches, I'd just build the manual lookup table of one to the other as outlined above and write a little routine that reads the headers and matches them up as wanted similarly as to how the above does and then use ismember or other set functions to grab the ones wanted and match 'em up. Simplest would probably be to go to table and create a shorter mnemonic name for all variables than these.

If this is the set of types of files, it shouldn't be hard at all and no point in getting too complicated...even if there are others that show up eventually, it probably still will be faster to just follow the same idea with each than trying to write totally generic code.

Amanda Figueroa 2019 年 2 月 10 日

Thanks! This is very helpful.

サインインしてコメントする。

I want to read similar column headers and standardize their names

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (1 件)

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

I want to read similar column headers and standardize their names

9 件のコメント 7 件の古いコメントを表示7 件の古いコメントを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (1 件)

7 件のコメント 5 件の古いコメントを表示5 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示