I want to read similar column headers and standardize their names

3 ビュー (過去 30 日間)
Amanda Figueroa
Amanda Figueroa 2019 年 2 月 4 日
回答済み: dpb 2019 年 2 月 11 日
Let's say I have a few csv files with column headers that are similar, but they actually represent the same data (the file header names (VariableNames) change per data file and I can't change this). How can I standardize the filenames so I can use detectImportOptions?
Here are the two column headers:
Data1= {'AnnotatedSequence','x_ProteinGroups','x_Proteins','x_MissedCleavages', ...
'Charge','m_z_Da_','MH__Da_','DeltaM_ppm_','Deltam_z_Da_', ...
'IonInjectTime_ms_', 'IsolationInterference___','RT_min_','PercolatorPEP'}
Data2 ={'AnnotatedSequence','Modifications','NumberOfProteinGroups', ...
'NumberOfProteins', 'NumberOfMissedCleavages','Charge','mzInDa', ...
'MHplusInDa','TheoMHplusInDa', 'DeltaMInPpm','DeltaMzInDa', ...
'Intensity','IsolationInterferenceInPercent', 'IonInjectTimeInMs', ...
'RTInMin','ApexRTInMin','PercolatorQ_Value','PercolatorPEP', ...
'PercolatorSVMScore'}
As you can see, some column headers in Data1 (e.g. x_ProteinGroups) contain the same type of data as Data2 (e.g. NumberOfProteinGroups). How can I standardize the name of the header so detectImportOptions can read any of these files?
I tried using strtok(Data1,'_') but that just removes the underscore from the files and I still have other string components that need to be removed. strcmpi would be useful if it didn't have a size requirement.
Your help is greatly appreciated.
  9 件のコメント
Adam Danz
Adam Danz 2019 年 2 月 8 日
It sounds like a mess but it's not uncommon. Often times data comes from different sources with different formats and the cleaning stage is inevitable.
You'll probably need to use a mix of strategies all arranged in a set of well thought-out conditionals. For example, start by search for embedded key words. If there are no matches, you might be able to search by matching expected values in the data. If there is more than one match, you'll need a second conditional to eliminate the incorrect match.
Some data might be integers (categorical data, counts, etc). Some data might have expected values between 0 and 1. Some data might be expected to be sparse, filled with mostly 0s. All of these are testable and you should include lots of sanity checks in this process. Here's an example sanity check
%These data should all be positive, non-zero integers.
if mod(x, 1) ~= 0 | x <=0
error('Sanity Check Failure. Data are not positive integers.')
end
Amanda Figueroa
Amanda Figueroa 2019 年 2 月 8 日
編集済み: Amanda Figueroa 2019 年 2 月 8 日
Hi dbp,
Yeah, I get your drift and I did write these down on paper and found the patterns. The problem is that many variables start and/or end the same ('DeltaMInPpm' , 'DeltaMInDa' , 'mzInDa','DeltaScore')... So I keep hitting these roadblocks.
I know the problem is the naming convention. I can talk to the company to fix this moving forward, but the data I have is already stored with these column names (countless experiments). I need to analyze these files that are already finished.
Thanks for the help, I will keep working on fixing this issue.

サインインしてコメントする。

採用された回答

dpb
dpb 2019 年 2 月 11 日
Picking up where left off above with two string arrays of quoted string column labels from the two shown-to-be-different files...
>> [lia,lib]=ismember(s3,s1); % see who does match (shorter file first this time)
>> whos li*
Name Size Bytes Class Attributes
lia 31x1 31 logical
lib 31x1 248 double
>> s3(lia) % the list of who does match between the two...
ans =
19×1 string array
"Checked"
"Confidence"
"Identifying Node"
"PSM Ambiguity"
"Annotated Sequence"
"Modifications"
"Master Protein Accessions"
"Protein Accessions"
"Charge"
"Rank"
"Search Engine Rank"
"Activation Type"
"MS Order"
"First Scan"
"Spectrum File"
"Ions Matched"
"XCorr"
"Percolator q-Value"
"Percolator PEP"
>>
Now see who doesn't match from this file...
>> s3(~lia)
ans =
12×1 string array
"# Protein Groups"
"# Proteins"
"# Missed Cleavages"
"DeltaScore"
"DeltaCn"
"m/z [Da]"
"MH+ [Da]"
"DeltaM [ppm]"
"Deltam/z [Da]"
"Isolation Interference [%]"
"Ion Inject Time [ms]"
"RT [min]"
>>
Then reverse the seearch longer array first...
>> [lia,lib]=ismember(s1,s3);
>> s1(~lia) % we alread knows who matcches
ans =
31×1 string array
"Identifying Node Type"
"Search ID"
"Identifying Node No"
"Sequence"
"Number of Protein Groups"
"Number of Proteins"
"Protein Descriptions"
"Number of Missed Cleavages"
"Original Precursor Charge"
"Delta Score"
"Delta Cn"
"Concatenated Rank"
"mz in Da"
"MHplus in Da"
"Theo MHplus in Da"
"Delta M in ppm"
"Delta mz in Da"
"Matched Ions"
"Total Ions"
"Intensity"
"Isolation Interference in Percent"
"Ion Inject Time in ms"
"RT in min"
"Last Scan"
"Master Scans"
"Annotation"
"Peptides Matched"
"Area"
"Apex RT in min"
"Percolator SVMScore"
"Precursor Quan Result ID"
>>
there are the two lists and where the matches/mismtches are located between the two files. All you need to doo is to line those back up again and use the string of one to find the matching and replace both with the wanted name going forward.
I'd probably replace all the blanks in the names with underscores as the first step before the anything else but that's just one way to do it.

その他の回答 (1 件)

Amanda Figueroa
Amanda Figueroa 2019 年 2 月 8 日
編集済み: Amanda Figueroa 2019 年 2 月 10 日
I found out what happened. The software output has the exact same column names (which I discovered by dragging each type of file into the workspace.
I import these files into matlab with detectImportOptions and it seems like detectImportOptions is what adds the "_" values
I've been working on fixing this for a week now to find out that it is a function. Lol
  7 件のコメント
dpb
dpb 2019 年 2 月 10 日
編集済み: dpb 2019 年 2 月 11 日
fid=fopen('buffer.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s1=string(l{:});
fid=fopen('DMSO.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s2=string(l{:});
fid=fopen('NBA.txt','r');
l=fgetl(fid);
title3=textscan(l,'%q')
fid=fclose(fid);
l=textscan(l,'%q');
s3=string(l{:});
Preliminaries out of way having read the header lines, let's see something about what we got...
>> all(ismember(s1,s2))==1
ans =
1
>>
So, the first two files are identical with 50 columns. The third is the odd man out as it only had 30 columns, but we proved the other two are actually the same, not just happen to have the same number of columns.
>> [s1(1:30) s3(1:30)]
displays the first 30 of the one and all the third and illustrates that the problem is there are blanks in the column names and differences in what they are labelled besides.
Given the limited number of mismatches, I'd just build the manual lookup table of one to the other as outlined above and write a little routine that reads the headers and matches them up as wanted similarly as to how the above does and then use ismember or other set functions to grab the ones wanted and match 'em up. Simplest would probably be to go to table and create a shorter mnemonic name for all variables than these.
If this is the set of types of files, it shouldn't be hard at all and no point in getting too complicated...even if there are others that show up eventually, it probably still will be faster to just follow the same idea with each than trying to write totally generic code.
Amanda Figueroa
Amanda Figueroa 2019 年 2 月 10 日
Thanks! This is very helpful.

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeMATLAB Mobile Fundamentals についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by