Detect datapoints deviating from the main, curved, cluster. Outlier detection
6 views (last 30 days)
Show older comments
Dear all, I need to identify some anomalous data points in my data. I have thousands of them, so I need to automatise the process, finding the best deal between time, accuracy, and sensitivity.
Anomalous points should be defined as those deviating from the main cluster, or like in the case below, from the main curve.
This is a typical scatter plot of my x,y data (please find xy data attached):
What I am mostly interested about, is to identify the data points with a positive deviation, namely those circled in red below:
The ones circled in blue, might be 'anomalous', but I understand they may be too close to the main cluster to be clearly (and/or statistically) picked as anomalous. The ones with a negative deviation (i.e., those circled in green) can also be flagged as anomalous, but I am not too interested on them.
What I am trying to achieve is something like the graph below (altough any other approach is more than welcome). Basically, I would like to fit a curve that pass through the main cluster and isolate the datapoints within this main cluster. Finally, I can flag those falling outside these hypothetical boundaries as potentially anomalous. Please note, the boundaries (as depicted by the red shaded area) do not need to be equally spaced along the curve, they can vary with the degree of spreading of the points, if that make sense.
Any help is grately appreciated!
0 Comments
Answers (1)
Bruno Luong
on 4 Sep 2023
Edited: Bruno Luong
on 4 Sep 2023
load('xy.mat')
x = xy(:,1);
y = xy(:,2);
n = 65;
xs = unique(x);
m = numel(xs);
edges = interp1(1:numel(xs), xs, linspace(1,m,n));
loc = discretize(x, edges);
xa = accumarray(loc(:), x(:), [], @median);
ya = accumarray(loc(:), y(:), [], @median);
dy = y-interp1(xa,ya,x,'linear','extrap');
b = isoutlier(dy); %abs(dy) > 3.4; % adjust the threes hold to your need
close all
plot(x,y,'.')
hold on
%plot(xa,ya,'g')
plot(x(b), y(b), 'or')
2 Comments
Bruno Luong
on 4 Sep 2023
Edited: Bruno Luong
on 5 Sep 2023
Take into account the density
load('xy.mat')
x = xy(:,1);
y = xy(:,2);
n = 65;
xs = unique(x);
m = numel(xs);
edges = interp1(1:numel(xs), xs, linspace(1,m,n));
loc = discretize(x, edges);
xa = accumarray(loc(:), x(:), [], @median);
ya = accumarray(loc(:), y(:), [], @median);
dy = y-interp1(xa,ya,x,'linear','extrap');
b = isoutlier(dy,"median");
densitythres = 20; % tune this parameter according to your preference
[N,~,~,binX,binY] = histcounts2(x,y,[50 50]);
b = b & N(sub2ind(size(N),binX,binY)) < densitythres;
close all
plot(x,y,'.')
hold on
plot(x(b), y(b), 'or')
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!