Filtering max values within every 3 hours of data
1 view (last 30 days)
Show older comments
Hey everyone,
I attached the first 50 rows of a sample dataset. My time data is in the format of '2000-01-01 01:00' where the last 5 characters hold the hour (HH) and minute (MM) values respectively. I would like to have a code that goes through the time column, and assigns the maximum value of the second column between the 01,02,03 hours to the 03 hour timestep. I need to then have this done for the 04 - 06 time period and so forth until 22 - 00.
At the end the cell array would not need to keep data from other columns and should just be XXXX x 2 cell array and just reassigns the maximums to the 03,06,09 ...... 00 timesteps. There may also be missing hours, so the script would need to accommodate for that.
example:
'1984-01-01 00:00' 10.80 240 20.06 231 0.8
'1984-01-01 01:00' 11.83 230 20.06 6 0.6
'1984-01-01 02:00' 11.31 230 18.51 100 0.6
'1984-01-01 03:00' 10.80 230 18.51 212 0.7
'1984-01-01 04:00' 9.77 240 18.65 348 0.8
'1984-01-01 05:00' 9.25 240 16.97 406 0.8
'1984-01-01 06:00' 9.77 230 17.48 524 0.7
'1984-01-01 07:00' 9.77 230 16.97 612 0.7
would then become:
'1984-01-01 03:00' 11.83
'1984-01-01 06:00' 9.77
I am completely lost in writing this and any help would be appreciated!
1 Comment
Cedric
on 11 May 2014
Well, not all answers below will work with missing data, but if you work for understanding them all (with their limitations), you'll learn a lot about data manipulation in MATLAB ;-)
Accepted Answer
Cedric
on 11 May 2014
Edited: Cedric
on 13 May 2014
You could build an approach along the following line:
% - Build array of time components.
dv = datevec( datatoupload(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build logical index of relevant/boundary hours.
hrId = rem( dv(:,4), 3 ) == 0 ;
hrId(end) = true ; % Last entry always relevant.
% - Build vector of (time-) group IDs.
hrGrId = cumsum( [true; hrId(1:end-1)] ) ;
% - Get statistics per group.
grMax = accumarray( hrGrId, cell2mat(datatoupload(:,2)), [], @max ) ;
% - Build output.
output = [datatoupload(hrId,1), num2cell(grMax)] ;
Applied to your MAT file, this gives:
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 12:00' [ 8.7448]
'1984-01-01 15:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
EDIT 1: the following solution should work with missing data, but you'll have to understand it and fully test it. In the beginning, I am building a test data set with missing data.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
% > Hours 4,5,6 have only data for hours 5,6.
% > Hours 10,11,12 have only data for hours 10, 11.
% > Hours 13,14,15 have no data.
data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Group start pos.
grEnd = [grStart(2:end); true] ;
% 4. Contiguous group IDs.
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
output = [data(grEnd,1), num2cell(grMax)] ;
Looking at the output
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 11:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
you see that it seems to be working when missing data hit boundary hours or not, and when they cover full period(s).
EDIT 2: the following solution removes data from groups/periods which don't include 3 entries/measures. Again, you have to fully understand it and test it. You may find an easier way to eliminate entries, I didin't really optimize.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
% > Hours 4,5,6 have only data for hours 5,6.
% > Hours 10,11,12 have only data for hours 10, 11.
% > Hours 13,14,15 have no data.
data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Contiguous group IDs, 1st shot, any size taken into account.
grId = cumsum( grStart ) ;
% 4. Flag data which do not belong to a full group (= 3 entries).
grIsValid = accumarray( grId, ones(size(grId)) ) == 3 ;
entryIsValid = ismember( grId, find(grIsValid) ) ;
% 5. Eliminate invalid entries, rebuild valid group starts and IDs.
data(~entryIsValid,:) = [] ;
grStart(~entryIsValid,:) = [] ;
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
grEnd = [grStart(2:end); true] ;
output = [data(grEnd,1), num2cell(grMax)] ;
With that, we obtain..
>> output
output =
'1984-01-01 03:00' [11.8312]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
which don't contain summaries when fewer than 3 entries are available.
8 Comments
Cedric
on 13 May 2014
Edited: Cedric
on 13 May 2014
My pleasure, I'm glad it helped!
You can favor simpler approaches when you don't know how to proceed. Basically, you know very well how to proceed (easily) by hand: you would take and discard the first entries until you find the beginning of a group, check that the group is full, take the max, associate it with the time stamp, and then repeat the process. This can quite easily be implemented with a loop and a few IF statements .. it may not be nice to see, not be too efficient either (even though, loops can be quite efficient nowadays), but you'd make it. The code would look like what follows in the great lines:
nRows = size( data, 1 ) ;
times = cell( nRows, 1 ) ; % Overshoot prealloc => trunc in the end.
maxes = zeros( nRows, 1 ) ; % Overshoot prealloc => trunc in the end.
inId = 0 ;
outId = 0 ;
while inId <= nRows
inId = inId + 1 ;
if "not a start hour"
continue ;
end
if "next 2 entries not contiguous"
continue ;
end
outId = outId + 1 ;
% Compute max of current and next 2 entries, store in maxes(outId)
...
% Store time stamp in times{outId}.
...
end
times = times(1:outId) ; % Truncate to relevant size.
maxes = maxes(1:outId) ; % Truncate to relevant size.
More Answers (2)
dpb
on 11 May 2014
An alternate approach...
>> dat=circshift(dat(:,1:2),-1); % place the first of groups of three on first row
>> ddat=cell2mat(dat(:,2)); % convert numeric to array for convenience(+)
>> [~,imx]=max(reshape(ddat(1:fix(length(dat)/3)*3),3,[])); % find max for each group
>> res=dat(imx+[0:3:(length(imx)-1)*3],1:2) % select those from full array
res =
'1984-01-01 01:00' [11.8300]
'1984-01-01 04:00' [ 9.7700]
>>
I didn't read Cedric's answer carefully enough to unequivocally answer your question re: missing values; I'd suggest trying it for a sample dataset and see how it works.
For correct operation of the above you will need to fill in missing data "holes" if they do exist similarly as to how Cedric built a date vector from your first to last to ensure the "divide by 3" works to build the right number of columns.
0 Comments
Andrei Bobrov
on 11 May 2014
Edited: Andrei Bobrov
on 11 May 2014
load upload.mat
a = datatoupload;
[~,~,~,hh] = datevec(a(:,1));
hh(hh==0) = 24;
h2 = (rem(hh-1,3)+1)==1;
h2(1) = 1;
ii = cumsum(h2);
[~,b] = unique(ii,'last');
out = [a(b,1) accumarray(ii,cat(1,a{:,2}),[],@(x){max(x)})];
add variant
t = datenum(1984,1,1,3*(0:ceil(diff(datenum(a([1,end],1)))*24/3))',0,0);
t0 = datenum(a(:,1));
t1=t+eps(1e6);
[~,ii] = histc(t0,t1);
out = [cellstr(datestr(t,'yyyy-mm-dd HH:MM')),...
accumarray(ii+1,cat(1,a{:,2}),[],@(x){max(x)})];
0 Comments
See Also
Categories
Find more on Data Distribution Plots in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!