delete rows- tanimoto matrix

1 view (last 30 days)
Shayma
Shayma on 7 Dec 2014
Edited: dpb on 11 Dec 2014
Hi everyone;
with help of some of you, we could write two scripts to delete rows, in tanimoto matrix which we create based on RDKit fingerprints.
the idea was to delete those rows (= molecules) which are similar to another molecules if :
1.tanimoto index > 0.7 but less than 1.
2. the sum of all tanimoto indexes in a row is larger than the compared molecule.
*in the first solution we used for loop, with a vector called "good_ones"; the rows that got zero we delete and we don't compare them again :
[M,text,alldata]=xlsread('test2.csv');
[r c]=size(M);
S=sum(M,2); %sum rows
good_ones = ones(1,size(M,1));
% loop over rows
for row=1:r;
for col=row+1:c;
if good_ones(row)==1
if (M(row,col) >= 0.7) & (M(row,col) <1.0 ); %if the value between 0.7 to 1 then we compare the sum column
if S(row)>S(col) % if the sum of the i line is larger we delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
end
% mark lines for deletion afterwards
end
end
end
new_M = M(find(good_ones),:);
  • in the second script we used the vector way, but in this solution, rows that we compare and mark to delete, are compared again to another molecules, and in this way we loose more molecules,so the question how i can add a condition here like i did previous and check only the rows that i didn't have check before???:*
an example file of the matrix attached
[M,text,alldata]=xlsread('test2.csv.csv') ;% M is numeric date
% text is the text :-)
% all data is both M + text
[i j]=size(M);
S=sum(M,2);
triM = tril(M); %lower diagonal
[r, c] = find(triM >= 0.7 & triM < 1.0); %find position of all values in range
deleterow = S(r) > S(c); %compare row r with respective row c
%a true in deleterow means delete respective r, otherwise delete respective c
todelete = unique([r(deleterow) ;c(~deleterow)]);
nM = M;
nM(todelete, :) = [];
text2=text(2:i+1,1);
text2(todelete, :) = [];
  7 Comments
dpb
dpb on 9 Dec 2014
OK, I do now grok the summation and yes, the "1" is immaterial as long as it's simply a ranking and not an absolute value.
"...them, in this case it will be the second molecule, maybe here we can add an additional condition to decide..."
In that case, change the > to >= and kill the row that lets you abort the loop earlier. It seems that one still needs to finish out the row otherwise.
I won't have time until perhaps the weekend (and can't promise that, even) to devote to trying the vector solution but given the condition of the decision process I don't think it's likely to be a case where it will save a lot of time. With the JIT compiler, loops aren't always the overhead they used to be. Unless you actually run into severe computational bottlenecks I'd suspect your best utilization of time/resources is to just "get on with it" at this point.
dpb
dpb on 9 Dec 2014
Another possible modification that might save some time would be to test whether there are any more on the row greater than the limit on the alternative test instead of continuing the loop on a "by one" step.
Oh, alternatively, instead of the inner loop, use a logical array test on the row at once...
if ~any(M(row,:)>0.7) % this row passes
break
else
% deal with them here...
end

Sign in to comment.

Accepted Answer

dpb
dpb on 10 Dec 2014
Edited: dpb on 11 Dec 2014
"...but if you have time i would like to see your solution :)"
Well, I've not tested this extensively but it does reproduce your script results for the sample dataset --
Presuming
a) S is the summation column vector, and
b) M is the tridiagonal form w/o the diagonal
>> M=M>0.7; % logical array
>> R=squareform(pdist(S,@(x,y) rdivide(x,y)))>1; % sum ratios to match
>> good = ~any(M&R,2) & ~any(M&~R,1).';
>> all(good==Good)
ans =
1
  1 Comment
dpb
dpb on 10 Dec 2014
Edited: dpb on 11 Dec 2014
From the command line session I used to debug...
>> [~any(MR&R,2) ~any(MR&~R,1).' [~any(MR&R,2) & ~any(MR&~R,1).'] Good]
ans =
1 1 1 1
0 1 0 0
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 1
1 0 0 0
1 1 1 1
1 1 1 1
>> all(ans(:,3)==ans(:,4))
ans =
1
>>
Good is a local variable from the output from your script that I saved for comparison purposes. MR is MR=M>0.7 logical array; while testing I kept the second variable but there's no need in the actual code it would seem.
As noted above, I have no extensive debugging here; it just seems like the correct logic to match up the rows/columns that are the result of the if...else...end clause but that's as far as I went in depth.
You might want to reverse some of the logic so it's not a negative in the end...that is particularly M<0.7 instead of M>0.7; the sum ratio is a wash. Altho it occurred to me that perhaps if kept the full array M (excluding the diagonal, of course) that perhaps R and 1/R for the two triangulars just might simplify it but again I didn't pursue that thought for lack of time to devote to the problem.
One last thought if you have a problem with this back on the looping solution. One could potentially do something like
pool=find(any(M>0.7,2)).';
for r=pool % only rows with high magnitude, not all
...
How much difference, if any, one might see would be highly dependent upon the dataset I think.
Also, however, one optimization that should make some difference is to convert the M array to the ratio before the loop instead of doing the calculation every pass...
M=M>0.7;
for r=1:nR
...
for c=r+1:nC
if M(r,c) % logical flag check instead of computation
...
I don't know if it would make any real difference in timing but I'd think the conversion I showed earlier of using a logical array instead of the 0,1 numeric and explicit comparison might save a few cycles...if nothing else it saves a find when you're done because can use the logic array directly as in indexing expression instead of finding the positions.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!