I want to read similar column headers and standardize their names
3 ビュー (過去 30 日間)
古いコメントを表示
Let's say I have a few csv files with column headers that are similar, but they actually represent the same data (the file header names (VariableNames) change per data file and I can't change this). How can I standardize the filenames so I can use detectImportOptions?
Here are the two column headers:
Data1= {'AnnotatedSequence','x_ProteinGroups','x_Proteins','x_MissedCleavages', ...
'Charge','m_z_Da_','MH__Da_','DeltaM_ppm_','Deltam_z_Da_', ...
'IonInjectTime_ms_', 'IsolationInterference___','RT_min_','PercolatorPEP'}
Data2 ={'AnnotatedSequence','Modifications','NumberOfProteinGroups', ...
'NumberOfProteins', 'NumberOfMissedCleavages','Charge','mzInDa', ...
'MHplusInDa','TheoMHplusInDa', 'DeltaMInPpm','DeltaMzInDa', ...
'Intensity','IsolationInterferenceInPercent', 'IonInjectTimeInMs', ...
'RTInMin','ApexRTInMin','PercolatorQ_Value','PercolatorPEP', ...
'PercolatorSVMScore'}
As you can see, some column headers in Data1 (e.g. x_ProteinGroups) contain the same type of data as Data2 (e.g. NumberOfProteinGroups). How can I standardize the name of the header so detectImportOptions can read any of these files?
I tried using strtok(Data1,'_') but that just removes the underscore from the files and I still have other string components that need to be removed. strcmpi would be useful if it didn't have a size requirement.
Your help is greatly appreciated.
9 件のコメント
Adam Danz
2019 年 2 月 8 日
It sounds like a mess but it's not uncommon. Often times data comes from different sources with different formats and the cleaning stage is inevitable.
You'll probably need to use a mix of strategies all arranged in a set of well thought-out conditionals. For example, start by search for embedded key words. If there are no matches, you might be able to search by matching expected values in the data. If there is more than one match, you'll need a second conditional to eliminate the incorrect match.
Some data might be integers (categorical data, counts, etc). Some data might have expected values between 0 and 1. Some data might be expected to be sparse, filled with mostly 0s. All of these are testable and you should include lots of sanity checks in this process. Here's an example sanity check
%These data should all be positive, non-zero integers.
if mod(x, 1) ~= 0 | x <=0
error('Sanity Check Failure. Data are not positive integers.')
end
採用された回答
dpb
2019 年 2 月 11 日
Picking up where left off above with two string arrays of quoted string column labels from the two shown-to-be-different files...
>> [lia,lib]=ismember(s3,s1); % see who does match (shorter file first this time)
>> whos li*
Name Size Bytes Class Attributes
lia 31x1 31 logical
lib 31x1 248 double
>> s3(lia) % the list of who does match between the two...
ans =
19×1 string array
"Checked"
"Confidence"
"Identifying Node"
"PSM Ambiguity"
"Annotated Sequence"
"Modifications"
"Master Protein Accessions"
"Protein Accessions"
"Charge"
"Rank"
"Search Engine Rank"
"Activation Type"
"MS Order"
"First Scan"
"Spectrum File"
"Ions Matched"
"XCorr"
"Percolator q-Value"
"Percolator PEP"
>>
Now see who doesn't match from this file...
>> s3(~lia)
ans =
12×1 string array
"# Protein Groups"
"# Proteins"
"# Missed Cleavages"
"DeltaScore"
"DeltaCn"
"m/z [Da]"
"MH+ [Da]"
"DeltaM [ppm]"
"Deltam/z [Da]"
"Isolation Interference [%]"
"Ion Inject Time [ms]"
"RT [min]"
>>
Then reverse the seearch longer array first...
>> [lia,lib]=ismember(s1,s3);
>> s1(~lia) % we alread knows who matcches
ans =
31×1 string array
"Identifying Node Type"
"Search ID"
"Identifying Node No"
"Sequence"
"Number of Protein Groups"
"Number of Proteins"
"Protein Descriptions"
"Number of Missed Cleavages"
"Original Precursor Charge"
"Delta Score"
"Delta Cn"
"Concatenated Rank"
"mz in Da"
"MHplus in Da"
"Theo MHplus in Da"
"Delta M in ppm"
"Delta mz in Da"
"Matched Ions"
"Total Ions"
"Intensity"
"Isolation Interference in Percent"
"Ion Inject Time in ms"
"RT in min"
"Last Scan"
"Master Scans"
"Annotation"
"Peptides Matched"
"Area"
"Apex RT in min"
"Percolator SVMScore"
"Precursor Quan Result ID"
>>
there are the two lists and where the matches/mismtches are located between the two files. All you need to doo is to line those back up again and use the string of one to find the matching and replace both with the wanted name going forward.
I'd probably replace all the blanks in the names with underscores as the first step before the anything else but that's just one way to do it.
0 件のコメント
その他の回答 (1 件)
Amanda Figueroa
2019 年 2 月 8 日
編集済み: Amanda Figueroa
2019 年 2 月 10 日
7 件のコメント
dpb
2019 年 2 月 10 日
編集済み: dpb
2019 年 2 月 11 日
fid=fopen('buffer.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s1=string(l{:});
fid=fopen('DMSO.txt','r');
l=fgetl(fid);
fid=fclose(fid);
l=textscan(l,'%q');
s2=string(l{:});
fid=fopen('NBA.txt','r');
l=fgetl(fid);
title3=textscan(l,'%q')
fid=fclose(fid);
l=textscan(l,'%q');
s3=string(l{:});
Preliminaries out of way having read the header lines, let's see something about what we got...
>> all(ismember(s1,s2))==1
ans =
1
>>
So, the first two files are identical with 50 columns. The third is the odd man out as it only had 30 columns, but we proved the other two are actually the same, not just happen to have the same number of columns.
>> [s1(1:30) s3(1:30)]
displays the first 30 of the one and all the third and illustrates that the problem is there are blanks in the column names and differences in what they are labelled besides.
Given the limited number of mismatches, I'd just build the manual lookup table of one to the other as outlined above and write a little routine that reads the headers and matches them up as wanted similarly as to how the above does and then use ismember or other set functions to grab the ones wanted and match 'em up. Simplest would probably be to go to table and create a shorter mnemonic name for all variables than these.
If this is the set of types of files, it shouldn't be hard at all and no point in getting too complicated...even if there are others that show up eventually, it probably still will be faster to just follow the same idea with each than trying to write totally generic code.
参考
カテゴリ
Help Center および File Exchange で Data Preprocessing についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!