minimum distances of data points from Fitted curve and Gaussian Mixture Model Clustering... not working.

2 views (last 30 days)
Hi everybody I would expect minimum distances of points from the curve follow a gaussian trend (as my professor aims) but I can't get a good graph at the end of the Gaussian Mixture Model..
May you have a check to my code? Do you see something wrong?
If you would be available to run my script I leave you the data.txt file and the matrix I use in the script at the following link: https://mega.nz/folder/goRHHLgY#fwnNkiLuFj-oZdaSPZmJ2Q
flowdata = readtable("flodata.txt");
% column 1: pressure at pipe start [bars]
% column 2: acoustic noise at start [standard deviation]
% column 3: pressure at pipe end (10km from start) [bars]
% column 4: acoustic noise at pipe end[standard deviation]
% column 5: flow rate [m3/hour]
load('flowdata.mat');
% summary(flowdata)
% Vectors of Pressures
pressureIN = flowdata.pressureAtPipeStart_bars_;
pressureOUT= flowdata.pressureAtPipeEnd_10kmFromStart__bars_;
% Vectors of Acoustic Noises at IN and at OUT
aNoiseIN=flowdata.acousticNoiseAtStart_standardDeviation_;
aNoiseOUT=flowdata.acousticNoiseAtPipeEnd_standardDeviation_;
% Vector of flowrate
flowrate = flowdata.flowRate_m3_hour_;
% % Table of flowRate
% flowRate = table(flowrate);
% rawData = [pressureIN aNoiseIN pressureOUT aNoiseOUT flowrate];
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
[samples,channels] = size(flowdata);
PHeadLoss = pressureIN-pressureOUT;
t = linspace(0, samples*sample_distance, samples)/86400;
figure()
plot(flowrate, PHeadLoss, '.')
densityplot(flowrate, PHeadLoss, [50, 50])
% histogram on raw data
hist(flowrate, 5000)
hold on
%hist(flowrate + 0.1*randn(numel(flowrate), 1), 5000) % leggera perturbazione rispetto ai valori originali
hist(PHeadLoss, 5000)
%flowrate
% all main curves overlaid
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time BEFORE EVERY CLEANING');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
rawData = [pressureIN pressureOUT flowrate];
% Removing the NaN values
% Creo un array di indici logici che indica quali righe contengono almeno un NaN
idx_nan = any(isnan(rawData), 2);
cleanData = rawData;
% Elimino le righe selezionate
cleanData(idx_nan, :) = [];
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
% Vectors of Pressures
pressureIN = cleanData(:,1);
pressureOUT= cleanData(:,2);
flowrate = cleanData(:,3);
PHeadLoss = pressureIN - pressureOUT;
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
hold on
% After having removed NaN
densityplot(flowrate, PHeadLoss, [50,50]);
% Now I remove the negative values of pressures and flowrate (No physical
% meaning)
% Creo un array di indici logici che indica quali righe hanno almeno un valore negativo
idx_neg = any(cleanData < 0, 2);
% Elimino le righe selezionate
cleanData(idx_neg, :) = [];
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
pressureIN = cleanData(:,1);
pressureOUT= cleanData(:,2);
flowrate = cleanData(:,3);
PHeadLoss = pressureIN - pressureOUT;
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%hold on
densityplot(flowrate, PHeadLoss, [50,50]);
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
% all main curves overlaid
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
% First hypothesis
% large variations on the output pressure can be seen at days 1 to 15 ,
% not related to input pression
% we should not use that portion of the data
% It's a brutal cleaning but...
% I am sceptical about what happens during the first 15 days :
% lots of peaks are seen on the output pressure side that are
% not reflected / correlated to the input side ,
% so I prefer to discard the first 20 days (yes it's brutal !)
ind = t>13;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
PHeadLoss = pressureIN-pressureOUT;
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
% Second Hypothesis
% keep data (valid) for gradient below a certain threshold (and non zero
% FR too btw)
% i also don't like those sudden drops to zero ,
% because to me its like dividing zero by zero ,
% when you want to make a relationship between two variables ,
% I avoid using background noise data , it will blurr the result.
% This function gradient calculates the gradient of a signal,
% which is the rate of change of the signal over time.
pressureIN_grad = abs(gradient(pressureIN));
pressureOUT_grad = abs(gradient(pressureOUT));
flowrate_grad = abs(gradient(flowrate));
threshold = 0.0073;
ind1 = pressureIN_grad<threshold*max(pressureIN_grad);
ind2 = pressureOUT_grad<threshold*max(pressureOUT_grad);
ind3 = flowrate_grad<threshold*max(flowrate_grad);
ind4 = flowrate> 0;
ind = ind1 & ind2 & ind3 & ind4;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
figure()
plot(t, pressureIN, '.g',t, pressureOUT, '.r',t, flowrate, '.b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
PHeadLoss = pressureIN - pressureOUT;
densityplot(flowrate, PHeadLoss, [50,50])
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
% Third Hypothesis
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLoss = pressureIN - pressureOUT;
% consider only positive values
ind = PHeadLoss> 0;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
PHeadLoss = pressureIN - pressureOUT;
densityplot(flowrate, PHeadLoss, [50,50])
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
%%%
%flowrate = flowrate + 0.1*randn(numel(flowrate), 1)
figure()
plot(flowrate, PHeadLoss, '*k')
hold on
% Now I create the FITTING CURVE
x2 = 0:0.01:max(flowrate);
%y2 = 0.002941548*x2.^2+45;
y2 = 0.00294150.*x2.^2+45;
plot(x2, y2)
% Now I compute the distances from each point to the fitted curve
[xy,distance_abs,t_a ]= distance2curve([x2' y2'], [flowrate PHeadLoss]);
hold on
plot(xy(:,1), xy(:,2), '.r') % projections of minimum distances points on the fitted curve
figure()
hist(distance_abs, 100) % istogramma su 100 bin
dx = xy(:,1)-flowrate;
dy = xy(:,2)-PHeadLoss;
dx = (dx-mean(dx));
dx = dx/prctile(abs(dx), 95);
dy = (dy-mean(dy));
dy = dy/prctile(abs(dy), 95);
figure
hist(dx, 100)
xlabel('dx')
figure()
hist(dy, 100)
ylabel('dy')
figure()
histogram2(dx, dy, 100)
xlabel('dx');
ylabel('dy');
figure()
plot(dx, dy, '.b')
xlabel('dx');
ylabel('dy');
figure()
scatter(dx, dy, 'filled')
xlabel('dx');
ylabel('dy');
densityplot(dx, dy, [50,50])
xlabel('dx');
ylabel('dy');
% GMM
%% Machine Learning - Clustering
bond.x = dx';
bond.y = dy';
%% Visualize Data
% Visualization of the slice data in the component "corp".
% Coupon Rate vs. YTM plot, differentiated by credit rating
gscatter(bond.x,bond.y)
% Label the plot
xlabel('bondx')
ylabel('bondy')
%% Select Features to use for Clustering
% Getting the data from the colums Coupon rate, YTM, currente Yeld and
% credit rating.
% Features
% Number of Clusters to create
%eva=evalclusters([bond.x' bond.y'],'kmeans','CalinskiHarabasz','KList',[1:6]);
eva=evalclusters([bond.x' bond.y'],'gmdistribution','CalinskiHarabasz','KList',[1:6]);
numClust = eva.OptimalK;
%% Gaussian Mixture Models (GMM)
gmobj = fitgmdist([bond.x' bond.y'],numClust);
gidx = cluster(gmobj,[bond.x' bond.y']);
gscatter(bond.x', bond.y', gidx)
% Visualize results
%figure()
%I = find(gidx == 1);
%plot(dx(I), dy(I), '*r');
%hold on
%I = find(gidx == 2);
%plot(dx(I), dy(I), '*b')
densityplot(dx, dy, [50,50])
%figure()
%I = find(gidx == 1);
%plot(bond.x(I), bond.y(I), '*r');
%hold on
%figure()
%I = find(gidx == 2);
%plot(bond.x(I), bond.y(I), '*b')
figure()
gscatter(bond.x',bond.y', gidx)
figure()
scatter3(bond.x',bond.y', gidx)
%
Thanks a lot in advance !

Answers (2)

Abhishek Chakram
Abhishek Chakram on 12 Feb 2024
Hi Giuseppe Zumbo,
Based on my observations on the code provided, you probably want to go through these when trying to fit a Gaussian Mixture Model (GMM) to the minimum distances of points from a curve to observe a Gaussian trend:
  • Experiment with different histogram bin sizes.
  • Verify the appropriateness of the quadratic curve.
  • Review the normalization process.
  • Visualize the Gaussian components of the GMM.
Here is corrected code snippet for the same:
% Load Data
% Assuming that the data is stored in 'flowdata.mat' with variable 'data'
load('flowdata.mat'); % Replace with the actual variable name and path
% If 'data' is not the variable name, replace it with the correct one
% For example, if the variable is named 'distances', use 'distances' instead of 'data'
% Prepare the data
% Assuming 'data' contains two columns corresponding to x and y coordinates
bond.x = data(:, 1);
bond.y = data(:, 2);
% Visualize Data
figure;
scatter(bond.x, bond.y);
xlabel('bond x');
ylabel('bond y');
title('Initial Data Scatter Plot');
% Determine the number of clusters using the Calinski-Harabasz criterion
eva = evalclusters([bond.x bond.y],'gmdistribution','CalinskiHarabasz','KList',[1:6]);
numClust = eva.OptimalK;
% Fit Gaussian Mixture Models (GMM)
gmobj = fitgmdist([bond.x bond.y], numClust);
% Assign data points to clusters
gidx = cluster(gmobj, [bond.x bond.y]);
% Visualize results with GMM cluster index
figure;
gscatter(bond.x, bond.y, gidx);
xlabel('bond x');
ylabel('bond y');
title('GMM Clustering');
% Visualize Gaussian components
hold on;
gmPDF = @(x,y) reshape(pdf(gmobj, [x(:) y(:)]), size(x));
fcontour(gmPDF, [min(bond.x) max(bond.x) min(bond.y) max(bond.y)]);
title('Gaussian Mixture Contours');
hold off;
% ... (any additional code)
You can refer to the following documentation to know more about the functions used:
Best Regards,
Abhishek Chakram

Giuseppe Zumbo
Giuseppe Zumbo on 15 Feb 2024
@Abhishek Chakram Thanks a lot for the answer. What I don't understand deeply is why I don't see gaussian shapes in gscatter plot, why should I add the code you wrote? Shound'nt it be visible also in the gscatter plot?
Thanks in advance for your answer

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!