Reducing nonuniform-timed data to x-minute averages

2 views (last 30 days)
I have an incredible amount of data recorded at nonuniform time intervals. I'll like to boil the data down to a multi-minute averages, which can be defined by a variable within the code, say "windowAvg" or "interval". I also have large gaps within the data where no data was recorded, and I'd need to represent that data as well with zeroes or "NaN" or something similar. Here's an example of what the data may look like:
  • DATE TIME (HH:MM:SS) DATA
  • 1-Mar-12 21:40:58 7.714284452
  • 1-Mar-12 21:40:59 7.076270192
  • 1-Mar-12 21:41:00 6.852246256
  • 1-Mar-12 21:41:01 8.065426585
  • 1-Mar-12 21:41:11 6.17580769
  • 1-Mar-12 21:41:12 8.216123385
  • 1-Mar-12 21:41:14 7.811008001
  • 1-Mar-12 21:41:15 6.888230101....etc
And then what I need it to look like (for example, 5-min avg):
  • DATE TIME (HH:MM:SS) DATA
  • 1-Mar-12 21:40:00 7.825590127
  • 1-Mar-12 21:45:00 7.488390079
  • 1-Mar-12 21:50:00 NaN
  • 1-Mar-12 21:55:00 7.314028628
My general plan was to create a full day's worth of a timestamp variable (I already have it in MATLAB serial time form), and then "find" where the data fall into the different time intervals, average, and then map the averaged value to the appropriate interval. I have a feeling, though, that there's an easier way to do this with obscure functions that I don't know about. I would appreciate any help.
Thanks, Nic

Accepted Answer

Walter Roberson
Walter Roberson on 14 Jun 2012
Use the two-output version of histc() on the serial date numbers in order to find out which interval each time is in. Feed the bin numbers and the actual data into accumarray(), specifying @mean as the function it should use, and specifying whichever fill value you want (default is 0). When you get the accumarray() result, use the row index to look up the serial date number of the bin (the one you handed to histc).
The code is quite straight forward once you understand histc() and accumarray()
  5 Comments
Walter Roberson
Walter Roberson on 26 Jun 2012
What that sounds like to me is roundoff error. 1/(24*60*60) is not representable as a finite binary number, so there could be round-off depending on how exactly the values were generated.
You could experiment to see if you get more accuracy by using datevec() on the string times, as that breaks out the minutes field individually.
ndb
ndb on 26 Jun 2012
Here's what I do have, for the "brute force" method:
clear all; clc; close all;
%get files in current dir
inputFile_870 = dir('*_870.txt');
%%Other input files
%define the number of files
size_870 = size(inputFile_870);
%%Other variables
%set up 5 minute interval bins
BIN{1} = 0:4;
BIN{2} = 5:9;
BIN{3} = 10:14;
BIN{4} = 15:19;
BIN{5} = 20:24;
BIN{6} = 25:29;
BIN{7} = 30:34;
BIN{8} = 35:39;
BIN{9} = 40:44;
BIN{10} = 45:49;
BIN{11} = 50:54;
BIN{12} = 55:59;
%preallocate variables for the instrument
data_870 = cell(size_870);
babs_avg_870 = cell(size_870);
%%Other Variables
for i=1:size_870
%import data from the i_th file
data_870{i} = importdata(inputFile_870(i).name);
data_870{i} = data_870{i}.data;
%create a flag variable for each day
%flag will be a number defining which bin your date falls into
flag = nan(length(data_870{i}),1);
%search through file and define where each measurement falls into each
%5 minute bin
for j = 1:length(data_870{i})
if isempty(find(BIN{1} == data_870{i}(j,5), 1)) == 0
flag(j) = 0;
elseif isempty(find(BIN{2} == data_870{i}(j,5), 1)) == 0
flag(j) = 5;
elseif isempty(find(BIN{3} == data_870{i}(j,5), 1)) == 0
flag(j) = 10;
elseif isempty(find(BIN{4} == data_870{i}(j,5), 1)) == 0
flag(j) = 15;
elseif isempty(find(BIN{5} == data_870{i}(j,5), 1)) == 0
flag(j) = 20;
elseif isempty(find(BIN{6} == data_870{i}(j,5), 1)) == 0
flag(j) = 25;
elseif isempty(find(BIN{7} == data_870{i}(j,5), 1)) == 0
flag(j) = 30;
elseif isempty(find(BIN{8} == data_870{i}(j,5), 1)) == 0
flag(j) = 35;
elseif isempty(find(BIN{9} == data_870{i}(j,5), 1)) == 0
flag(j) = 40;
elseif isempty(find(BIN{10} == data_870{i}(j,5), 1)) == 0
flag(j) = 45;
elseif isempty(find(BIN{11} == data_870{i}(j,5), 1)) == 0
flag(j) = 50;
elseif isempty(find(BIN{12} == data_870{i}(j,5), 1)) == 0
flag(j) = 55;
end
end
%assign a datenumber to each valid date of data
dnum = datenum(data_870{i}(:,1),data_870{i}(:,2),data_870{i}(:,3),...
data_870{i}(:,4),flag(:,1),0);
%find the unique date values
uniqDate = unique(dnum);
%preallocate
babs_avg_870{i} = nan(length(uniqDate),1);
%%Other Variables
%find and perform operations on data
for j = 1:length(uniqDate);
loc = find(dnum == uniqDate(j));
babs_avg_870{i}(j) = mean(data_870{i}(loc,7));
%%Other operations
end
end
----------------
The problem with doing it this way is, of course, speed, because I have many large data files and variables to parse and to perform operations on. Secondly, I liked how the method that you suggested inserted zeros or NaNs or whatever for periods of time that were missing from the data. I need that. Thanks for all your suggestions

Sign in to comment.

More Answers (0)

Categories

Find more on Dates and Time in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!