using PCA to filter outliers in MATLAB
8 views (last 30 days)
Show older comments
A bit of a silly question (I suspect). I am using MATLAB to preprocess some bulk RNAseq data. I am calling PCA on my normalized counts just as an initial way to identify outliers I didn't automatically reject. There are indeed some clear outliers but I am not sure how to use the score matrix output by pca to filter my original data because it is now in descending order of explained variance. How can I use the pca scores to return to the indicies of problematic samples. Thank you so much! Info below:
[coeff,score,latent,tsquared,explained] = pca(normCounts);
size(normCounts) 136 samples with 24427 measured genes
ans =
136 24427
2 Comments
Accepted Answer
John D'Errico
on 21 May 2024
Edited: John D'Errico
on 22 May 2024
I've attached my quick shot at an n-dimensional outlier tool, based on Mahalanobis distance. I am pretty sure it has some serious issues, and since I refuse to accept the label of statistician, I'll choose to ignore them. :)
For some random data, with 10% potential outliers...
X = randn(100,2)*randn(2,3) + randn(100,3)/50;
X = [X;randn(10,3)*50+5];
As you should see, the last 10 of the 110 points will be potential outliers, though some of those points might just randomly have fallen inside the upper cloud. Use it in a sequential fashion, removing the worst point, IF the flag is greater than 1.
retainset = 1:110;
outlierlist = [];
continueflag = true;
tol = 0.001;
while continueflag
outflag = mahalOutliers(X(retainset,:),tol);
[maxflag,maxind] = max(outflag);
if maxflag > 1;
outlierlist(end+1) = retainset(maxind);
retainset(maxind) = [];
else
continueflag = false;
end
end
outlierlist
So, with a tolerance set at 0.001, the tool identified points numbered 105, 108, 102, etc., in sequence, then happily stopped after it found 10 outliers. Coincidentally, those were the 10 points I added to the set.
As I said, the idea surely has flaws. And I would seriously agree with everything said by @the cyclist.
More Answers (1)
the cyclist
on 21 May 2024
I frankly did not read all your code, but you might also be interested in this tutorial-style answer I wrote about using MATLAB's pca function. In particular, if you are trying to remove observations that are outliers along the principal component axes, you should be able to
- apply pca
- apply rmoutliers on the variable I call dataInPrincipalComponentSpace
- transform that variable back to the original space
5 Comments
John D'Errico
on 21 May 2024
Edited: John D'Errico
on 21 May 2024
I was kind of vague, partly because I'm not really a true sadistician. I hope I spelled that correclty. ;-)
But there are two issues I see. One is the question of lack of normality. For example, consider the set:
X = exp(randn(1000,3));
X is lognormally distributed. And I am sure any PCA on X would identify some of those points as outliers. But if you consider the log of that sample, you would recognize there are no outliers at all. Everything is as perfectly well behaved as possible in that domain. Even worse, suppose I added in one new point:
X2 = [X;exp([-8 -8 -8])];
I would conjecture that this new point will not even be remotely observed as an outlier by any PCA based scheme, but back in the log space, that last point is out in the weeds. The problem in all of this is PCA implicitly presumes normality. And when you lack normality, PCA will fail to perform well. Worse, any data that contains outliers explicitly fails the presumption of normally distributed data.
The second issue I was thinking of is how a PCA will treat data with an outlier in it. The scores will all be strange, and potentially might miss an outlier. I think I need to come up with an example of what I am thinking, but my proposal below would seem to eliminate that issue anyway.
Anyway, if you want to do this, I would work in terms of a leave-one-out Mahalanobis distance, instead of using a PCA. Thus for each point, compute the mean and convariance matrix of the set, leaving that point out. Now, compute the distance for the point you left out. Repeat for every point in your set. Choose the one with the largest distance measure, IF that distance is sufficiently large in terms of Mahalanobis distance. Drop it from your set. Now lather, rinse and repeat, until none of the points are sufficiently far out to be considered an outlier by this scheme.
Will that scheme fail for the lognormal example I showed before? Of course.
Finally, does rmoutliers do something like this already? I'd need to check on that, but I don't see that it does, at least not at a glance.
the cyclist
on 21 May 2024
The transformation to, and back from, the principal components will not change the order of the observations. It will only change the coordinate system.
The cautions you received from @John D'Errico are valid (of course). I took the "low road", in just telling you how to accomplish what you asked, rather than the high road of giving you the many things to think about in what you are doing. I have a 90-page book on how to detect and handle outliers, and it's just difficult for me to know where to draw the line on what to share here. :-)
I think the most important short lesson is to ask yourself why are you dropping an observation. The simple fact that it lies away from the bulk of the distribution -- without any hypothesis about why this may have happened -- is not a very good reason to do so.
See Also
Categories
Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!