compare and remove duplicates, not unique alone!

Hi all :) I have an excel file(3 columns, many rows): there are duplicates here, by the third columns (CANONICAL_SMILES), but what i want is not only to remove the duplicates but also check: 1. if the ratio between the values of the second columns (STANDARD_VALUE) > 50, i'll delete both rows 2. if its smaller i'll keep the row with the largest value
*could be more than 2 rows which have duplicates, so i have to compare 3 or 4 rows each time and decide ahich one to keep .. any help here, thank you
**test file attached

4 件のコメント

José-Luis
José-Luis 2016 年 12 月 21 日
Start by loading your excel file to Matlab ( xlsread() , readtable() ), and then give it a shot. We'll be glad to help you if/when you get stuck.
Shayma
Shayma 2016 年 12 月 21 日
well i stuck exactly in what i asked,
[M,text,d]=xlsread(input_file); % M is numeric date
% text is the text :-)
% all data is both M + text
%find the column which have the activity value
activity= strmatch ('STANDARD_VALUE', text(1,:));
smiles= strmatch ('CANONICAL_SMILES', text(1,:));
name=strmatch ('CMPD_CHEMBLID', text(1,:));
%check the validity of name columns
if isempty(activity) || isempty(smiles) || isempty(name)
errordlg('Check your columns headers','Error','modal');
return;
end
%sort descending the activity value
[s,id]=sort([d{2:end,activity}],'descend');
sorted_d= d([1, id+1],:);
%check the names
[~,i]=unique(sorted_d(:,name),'stable');
wd=sorted_d(i,:);
%check the smiles
[h,k,j]=unique(wd(:,smiles),'stable');
dup_indx=setdiff(1:size(wd,1), k( sum(bsxfun(@eq,j,(1:max(j))))<=1 ))
if wd(dup_indx,smiles
José-Luis
José-Luis 2016 年 12 月 21 日
編集済み: José-Luis 2016 年 12 月 21 日
What you asked was not clear at all, at least to me. Help us help you. A good start is a self-contained, minimum working example.
Case in point, I didn't even try to make sense of:
" CMPD_CHEMBLID STANDARD_VALUE CANONICAL_SMILES CHEMBL201383 1234 Brc1ccc(cc1)C1N(c2ccc(Cl)cc2)C(=O)N(C1=O)c1ccc(Cl)cc1 CHEMBL256753 220 Brc1ccc(cc1S(=O)(=O)N1CCCC1)C(=O)N[C@H]1[C@@]2(CC[C@@H](C2)C1(C)C)C CHEMBL256752 280 Brc1ccc(cc1S(=O)(=O)N1CCCCC1)C(=O)N[C@H]1[C@@]2(CC[C@@H](C2)C1(C)C)C CHEMBL256962 30000 Brc1ccc(cc1S(=O)(=O)N1CCCCC1)C(=O)N[C@H]1[C@@]2(CC[C@@H](C2)C1(C)C)C CHEMBL383287 195 Brc1ccc(N2C(c3ccccc3)C(=O)N(c3ccc(Br)cc3)C2=O)cc1 CHEMBL381922 1624 Brc1ccc(N2C(c3ccccc3)C(=O)N(c3ccc(Br)cc3)C2=O)cc1 CHEMBL570683 10000 Clc1c(cccc1Cl)-c1cc(C(=O)N2CCOCC2)c(nc1)N1CCCCCC1 CHEMBL576079 10000 Clc1c(cccc1Cl)-c1ccc(nc1)N1CCCCCC1 "
because it is so poorly formatted and the variable names (if that's what they are) are nightmare-inducing.
Also, the snippet you posted subsequently is incomplete. Even if it was complete, we don't have the data to run it. Spurious errors might be easy to spot, but debugging is more efficient with actual data.
Shayma
Shayma 2016 年 12 月 21 日
i agree that the format in which i posted earlier was not clear, that's why i attached an example file...
the code is not complete, because i didn't know how to proceed with that

サインインしてコメントする。

回答 (1 件)

Guillaume
Guillaume 2016 年 12 月 21 日
編集済み: Guillaume 2016 年 12 月 21 日

0 投票

"if the ratio between the values of the second columns > 50"
"could be more than 2 rows which have duplicates, so i have to compare 3 or 4 rows"
What is the definition of the ratio when there are more than 2 rows to compare?
Ignoring that for now, here is how I would do it:
t = readtable('test1.xls'); %a lot more powerful than xlsread
assert(all(ismember({'CMPD_CHEMBLID', 'STANDARD_VALUE', 'CANONICAL_SMILES'}, t.Properties.VariableNames)), ...
'Check your columns headers');
[~, ~, subs] = unique(t.CANONICAL_SMILES, 'stable');
rows = (1:height(t)).';
rowstokeep = accumarray(subs, rows, [], @(r) filterrows(t, r));
tfiltered = t(nonzeros(rowstokeep), :)
With a separate function filterrows as follow:
function rowtokeep = filterrows(t, identicalrows)
if numel(identicalrows) == 1
%no duplicate. always keep
rowtokeep = identicalrows;
else
standardvalues = t.STANDARD_VALUE(identicalrows);
[minsv, minrow] = min(standardvalues);
standardvalues(minrow) = []; %never kept anyway
identicalrows(minrow) = []; %and remove from index
ratio = standardvalues / minsv;
if all(ratio > 50)
rowtokeep = 0; %don't keep anything. return 0 instead. 0 will be filtered by main function
else
%keep the row with the greatest ratio but less than 50? Don't know if that's what you want.
identicalrows(ratio > 50) = [];
ratio(ratio > 50) = [];
[~, maxrow] = max(ratio);
rowtokeep = identicalrows(maxrow);
end
end
end
You can tailor the filter function to whichever definition you have. It must return a scalar index of the row to keep for duplicate, or 0 if none is to be kept.

3 件のコメント

Shayma
Shayma 2016 年 12 月 22 日
Hi if there are more than 2 rows, so i have to check the ratio between all the "duplicates", if for one couple> 50 it will delete all the relevant rows, if its smaller for all of the duplicates, it will keep only one row with the largest value of "standard value"
i'll check your code now to see if it's work, thank you
Guillaume
Guillaume 2016 年 12 月 22 日
That actually makes the filtering simpler:
function rowtokeep = filterrows(t, identicalrows)
if numel(identicalrows) == 1
%no duplicate. always keep
rowtokeep = identicalrows;
else
standardvalues = t.STANDARD_VALUE(identicalrows);
[maxsv, maxrow] = max(standardvalues);
if maxsv / min(standardvalues) > 50
rowtokeep = 0; %don't keep anything. return 0 instead. 0 will be filtered by main function
else
rowtokeep = maxrow;
end
end
end
Shayma
Shayma 2016 年 12 月 22 日
Hi I did some modification to the first function which works well, but in the condition line: %check if there is any value > 50 if any (diff>50) rowtokeep = 0;
even that in the second run of the file, i have in the diff matrix one value >50, but it skips the condition and continue saving the row!! any hint why?
function rowtokeep = filterrows(t, identicalrows)
if numel(identicalrows) == 1
%no duplicate. always keep
rowtokeep = identicalrows;
else
standardvalues = t.STANDARD_VALUE(identicalrows);
%divide all the elements of the vector by each other
diff = ( bsxfun( @ldivide, standardvalues, standardvalues') ) ;
%check if there is any value > 50
if any (diff>50)
rowtokeep = 0;
else
[maxsv, maxrow]= max(standardvalues);
rowtokeep = identicalrows(maxrow);
end
end
end

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeCharacters and Strings についてさらに検索

質問済み:

2016 年 12 月 21 日

コメント済み:

2016 年 12 月 22 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by