Detect datapoints deviating from the main, curved, cluster. Outlier detection

5 views (last 30 days)
Dear all, I need to identify some anomalous data points in my data. I have thousands of them, so I need to automatise the process, finding the best deal between time, accuracy, and sensitivity.
Anomalous points should be defined as those deviating from the main cluster, or like in the case below, from the main curve.
This is a typical scatter plot of my x,y data (please find xy data attached):
What I am mostly interested about, is to identify the data points with a positive deviation, namely those circled in red below:
The ones circled in blue, might be 'anomalous', but I understand they may be too close to the main cluster to be clearly (and/or statistically) picked as anomalous. The ones with a negative deviation (i.e., those circled in green) can also be flagged as anomalous, but I am not too interested on them.
What I am trying to achieve is something like the graph below (altough any other approach is more than welcome). Basically, I would like to fit a curve that pass through the main cluster and isolate the datapoints within this main cluster. Finally, I can flag those falling outside these hypothetical boundaries as potentially anomalous. Please note, the boundaries (as depicted by the red shaded area) do not need to be equally spaced along the curve, they can vary with the degree of spreading of the points, if that make sense.
Any help is grately appreciated!

Answers (1)

Bruno Luong
Bruno Luong on 4 Sep 2023
Edited: Bruno Luong on 4 Sep 2023
load('xy.mat')
x = xy(:,1);
y = xy(:,2);
n = 65;
xs = unique(x);
m = numel(xs);
edges = interp1(1:numel(xs), xs, linspace(1,m,n));
loc = discretize(x, edges);
xa = accumarray(loc(:), x(:), [], @median);
ya = accumarray(loc(:), y(:), [], @median);
dy = y-interp1(xa,ya,x,'linear','extrap');
b = isoutlier(dy); %abs(dy) > 3.4; % adjust the threes hold to your need
close all
plot(x,y,'.')
hold on
%plot(xa,ya,'g')
plot(x(b), y(b), 'or')
  2 Comments
Simone A.
Simone A. on 4 Sep 2023
Hi Bruno, thanks for getting back! I have tried to play around with the code you kindly wrote, but I am not able to isolate only the points circled in red in the second figure of my question. If I keep a lower threshold i also get the data points between 270 and 275 on the x axis (which should not be flagged as potentially anomalous), instead, if I increase the threshold to higher values, I am able to exclude those between 270 and 275, but I also exclude several anomalous points between ~ 260 and 267.5. The thing is, the points between 270 and 275 are kind of clustered together, reason why I do not want them to be flagged as potentially outlier. On the other hand, those between ~ 260 and 267.5 are well sparsed. I hope that made it a bit clearer
Bruno Luong
Bruno Luong on 4 Sep 2023
Edited: Bruno Luong on 5 Sep 2023
Take into account the density
load('xy.mat')
x = xy(:,1);
y = xy(:,2);
n = 65;
xs = unique(x);
m = numel(xs);
edges = interp1(1:numel(xs), xs, linspace(1,m,n));
loc = discretize(x, edges);
xa = accumarray(loc(:), x(:), [], @median);
ya = accumarray(loc(:), y(:), [], @median);
dy = y-interp1(xa,ya,x,'linear','extrap');
b = isoutlier(dy,"median");
densitythres = 20; % tune this parameter according to your preference
[N,~,~,binX,binY] = histcounts2(x,y,[50 50]);
b = b & N(sub2ind(size(N),binX,binY)) < densitythres;
close all
plot(x,y,'.')
hold on
plot(x(b), y(b), 'or')

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!