Data Smoothing and Outlier Detection
Data smoothing refers to techniques for eliminating unwanted noise or behaviors in data, while outlier detection identifies data points that are significantly different from the rest of the data.
Moving Window Methods
Moving window methods are ways to process data in smaller batches at a time, typically in order to statistically represent a neighborhood of points in the data. The moving average is a common data smoothing technique that slides a window along the data, computing the mean of the points inside of each window. This can help to eliminate insignificant variations from one data point to the next.
For example, consider wind speed measurements taken every minute for about 3 hours. Use the
movmean function with a window size of 5 minutes to smooth out high-speed wind gusts.
load windData.mat mins = 1:length(speed); window = 5; meanspeed = movmean(speed,window); plot(mins,speed,mins,meanspeed) axis tight legend('Measured Wind Speed','Average Wind Speed over 5 min Window', ... 'location','best') xlabel('Time') ylabel('Speed')
Similarly, you can compute the median wind speed over a sliding window using the
medianspeed = movmedian(speed,window); plot(mins,speed,mins,medianspeed) axis tight legend('Measured Wind Speed','Median Wind Speed over 5 min Window', ... 'location','best') xlabel('Time') ylabel('Speed')
Not all data is suitable for smoothing with a moving window method. For example, create a sinusoidal signal with injected random noise.
t = 1:0.2:15; A = sin(2*pi*t) + cos(2*pi*0.5*t); Anoise = A + 0.5*rand(1,length(t)); plot(t,A,t,Anoise) axis tight legend('Original Data','Noisy Data','location','best')
Use a moving mean with a window size of 3 to smooth the noisy data.
window = 3; Amean = movmean(Anoise,window); plot(t,A,t,Amean) axis tight legend('Original Data','Moving Mean - Window Size 3')
The moving mean achieves the general shape of the data, but doesn't capture the valleys (local minima) very accurately. Since the valley points are surrounded by two larger neighbors in each window, the mean is not a very good approximation to those points. If you make the window size larger, the mean eliminates the shorter peaks altogether. For this type of data, you might consider alternative smoothing techniques.
Amean = movmean(Anoise,5); plot(t,A,t,Amean) axis tight legend('Original Data','Moving Mean - Window Size 5', ... 'location','best')
Common Smoothing Methods
smoothdata function provides several smoothing options such as the Savitzky-Golay method, which is a popular smoothing technique used in signal processing. By default,
smoothdata chooses a best-guess window size for the method depending on the data.
Use the Savitzky-Golay method to smooth the noisy signal
Anoise, and output the window size that it uses. This method provides a better valley approximation compared to
[Asgolay,window] = smoothdata(Anoise,'sgolay'); plot(t,A,t,Asgolay) axis tight legend('Original Data','Savitzky-Golay','location','best')
window = 3
The robust Lowess method is another smoothing method that is particularly helpful when outliers are present in the data in addition to noise. Inject an outlier into the noisy data, and use robust Lowess to smooth the data, which eliminates the outlier.
Anoise(36) = 20; Arlowess = smoothdata(Anoise,'rlowess',5); plot(t,Anoise,t,Arlowess) axis tight legend('Noisy Data','Robust Lowess')
Outliers in data can significantly skew data processing results and other computed quantities. For example, if you try to smooth data containing outliers with a moving median, you can get misleading peaks or valleys.
Amedian = smoothdata(Anoise,'movmedian'); plot(t,Anoise,t,Amedian) axis tight legend('Noisy Data','Moving Median')
isoutlier function returns a logical 1 when an outlier is detected. Verify the index and value of the outlier in
TF = isoutlier(Anoise); ind = find(TF)
ind = 36
Aoutlier = Anoise(ind)
Aoutlier = 20
You can use the
filloutliers function to replace outliers in your data by specifying a fill method. For example, fill the outlier in
Anoise with the value of its neighbor immediately to the right.
Afill = filloutliers(Anoise,'next'); plot(t,Anoise,t,Afill) axis tight legend('Noisy Data with Outlier','Noisy Data with Filled Outlier')
Not all data consists of equally spaced points, which can affect methods for data processing. Create a
datetime vector that contains irregular sampling times for the data in
time vector represents samples taken every minute for the first 30 minutes, then hourly over two days.
t0 = datetime(2014,1,1,1,1,1); timeminutes = sort(t0 + minutes(1:30)); timehours = t0 + hours(1:48); time = [timeminutes timehours]; Airreg = rand(1,length(time)); plot(time,Airreg) axis tight
smoothdata smooths with respect to equally spaced integers, in this case,
1,2,...,78. Since integer time stamps do not coordinate with the sampling of the points in
Airreg, the first half hour of data still appears noisy after smoothing.
Adefault = smoothdata(Airreg,'movmean',3); plot(time,Airreg,time,Adefault) axis tight legend('Original Data','Smoothed Data with Default Sample Points')
Many data processing functions in MATLAB®, including
filloutliers, allow you to provide sample points, ensuring that data is processed relative to its sampling units and frequencies. To remove the high-frequency variation in the first half hour of data in
Airreg, use the
'SamplePoints' option with the time stamps in
Asamplepoints = smoothdata(Airreg,'movmean', ... hours(3),'SamplePoints',time); plot(time,Airreg,time,Asamplepoints) axis tight legend('Original Data','Smoothed Data with Sample Points')