Treat and handle missing hourly data (with daily profile), that might have large gaps
2 views (last 30 days)
I want to treat huge missy temperature data with many missing values (presented as 999.9).
If there is few missing data within the day, I would take average from data before and after. But if I have large missing clusters (almost full-day missing, or up to 100 values in a row), I would take average of 1PM temperature from yesterday and 1PM temperature from tomorrow to get 1PM value for today, and same goes for all hours.
Note: I don't wish to change valid assigned tempratures linked to hours (like what interp1 would do with values order).
What can I use to handle these data?
08/09/2016 4:00:00 26
08/09/2016 5:00:00 26
08/09/2016 6:00:00 25
08/09/2016 6:00:00 999.9
08/09/2016 7:00:00 24
08/09/2016 8:00:00 25
08/09/2016 9:00:00 24
08/09/2016 9:00:00 999.9
08/09/2016 10:00:00 23
Jiri Hajek on 24 Nov 2022
To identify the clusters of outliers, one may use logical indexing and the time vector. This is just a skeletal draft of the algorithm, but you can get the idea.
timeColumn % your datatime values
temperatureColumnRaw % your original temperatures
outlierPoints = temperatureColumnRaw > 900;
outlierTimes = timeColumn(outlierPoints);
timeDifsOfOutliers = diff(outlierTimes);
clusterStartsLogical = [1; timeDifsOfOutliers > mode(diff(timeColumn))];
clusterStartTimes = outlierTimes(clusterStartsLogical);
nClusters = length(clusterStart);
if nClusters > 1
clusterStartIndices = find(clusterStartsLogical);
clusterEndPoints = [clusterStartIndices(2:end)-1;length(outlierTimes)];
clusterEndTimes = outlierTimes(clusterEndPoints);
clusterDurations = clusterEndTimes-clusterStartTimes;
shortClusterIndices = clusterDurations > hours(3); % you define, what is a short cluster