Treat and handle missing hourly data (with daily profile), that might have large gaps

Question

Anwaar Alghamdi on 24 Nov 2022

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1861273-treat-and-handle-missing-hourly-data-with-daily-profile-that-might-have-large-gaps

Edited: Anwaar Alghamdi on 24 Nov 2022

I want to treat huge missy temperature data with many missing values (presented as 999.9).

If there is few missing data within the day, I would take average from data before and after. But if I have large missing clusters (almost full-day missing, or up to 100 values in a row), I would take average of 1PM temperature from yesterday and 1PM temperature from tomorrow to get 1PM value for today, and same goes for all hours.

Note: I don't wish to change valid assigned tempratures linked to hours (like what interp1 would do with values order).

What can I use to handle these data?

08/09/2016 	4:00:00	 26
08/09/2016 	5:00:00	 26
08/09/2016 	6:00:00	 25
08/09/2016 	6:00:00	 999.9
08/09/2016 	7:00:00	 24
08/09/2016 	8:00:00	 25
08/09/2016 	9:00:00	 24
08/09/2016 	9:00:00	 999.9
08/09/2016 	10:00:00 23

5 Comments
Show 3 older commentsHide 3 older comments

Anwaar Alghamdi on 24 Nov 2022

@Jiri Hajek

Also, if I do linear interpolation, the non-999 values will be missed up (at least their order). I don't want to touch the temperatures assigned for each hour. Only estimate the 999 values.

Jiri Hajek on 24 Nov 2022

As for the cluster identification, I can give you some hints - will put them below into an answer. As for the handling of large missing clusters, I would leave themo out, i.e. constrain the scope.

Sign in to comment.

Sign in to answer this question.

Answer 1

Jiri Hajek on 24 Nov 2022

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1861273-treat-and-handle-missing-hourly-data-with-daily-profile-that-might-have-large-gaps#answer_1110293

Open in MATLAB Online

To identify the clusters of outliers, one may use logical indexing and the time vector. This is just a skeletal draft of the algorithm, but you can get the idea.

timeColumn  % your datatime values
temperatureColumnRaw % your original temperatures
outlierPoints = temperatureColumnRaw > 900;
outlierTimes = timeColumn(outlierPoints);
timeDifsOfOutliers = diff(outlierTimes);
clusterStartsLogical = [1; timeDifsOfOutliers > mode(diff(timeColumn))];
clusterStartTimes = outlierTimes(clusterStartsLogical);
nClusters = length(clusterStart); 
if nClusters > 1
    clusterStartIndices = find(clusterStartsLogical);
    clusterEndPoints = [clusterStartIndices(2:end)-1;length(outlierTimes)];
    clusterEndTimes = outlierTimes(clusterEndPoints);
end
clusterDurations = clusterEndTimes-clusterStartTimes;
shortClusterIndices = clusterDurations > hours(3);   % you define, what is a short cluster