how to get rid of NaN column by column
11 views (last 30 days)
Show older comments
Dear members,
I have a program that allows me to what I set out to do on a single time series. Now that I’m satisfied with that, I’ve got to roll it out on a whole bunch of time series. Say I have an indicator for many countries, where data availability differs, so lots of NaN-there are plenty of members posting answers and questions about getting rid of NaN in case of vectors, or, as far as matrices are concerned, getting rid of entire rows or columns whenever a NaN appears.
For instance:
- https://uk.mathworks.com/matlabcentral/answers/179142-how-can-i-remove-nan-values-from-a-matrix
- https://stackoverflow.com/questions/5202680/matlab-how-to-efficiently-remove-nan-elements-from-matrix#
- http://uk.mathworks.com/matlabcentral/fileexchange/41941-snip-m-snip-elements-out-of-vectors-matrices
However, I don’t want to get rid of whole rows or columns, but simply to purge each column independently of its NaN regardless of what happens to the others. How would I go about doing this? Any ideas would be much appreciated.
I started with something like this:
% clearvars
data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');
datacolumns=size(data,2);
for bycolmn =1:length(datacolumns);
And then I got stuck.
This:
data(isnan(data))= []
deletes the NaN and returns my data into one big long vector. Clearly something of this fashion ought to go in the loop above, but I'm not sure how.
Many thanks in advance
6 Comments
dpb
on 17 Sep 2017
What is an "indicator" and what is meaning of the variables? The data in the attached spreadsheet shows
>> sum(isfinite(DM))
ans =
Columns 1 through 17
214 203 192 230 249 189 230 230 220 108 100 108 230 249 230 230 182
Columns 18 through 26
230 125 218 164 230 170 185 248 249
>>
so that there are no records of full length of 250 (albeit a few are close). The base methods for spectral estimation in Matlab presume uniform sampling; if the 250 observations are the time series then you'll need other techniques than simply analyzing the finite elements of each column; getting those is the least of your problems.
Accepted Answer
Cedric
on 17 Sep 2017
Edited: Cedric
on 18 Sep 2017
I think that the major problem is basic data manipulation, and once you have that right you'll be fine for the rest. Here are a few hints and we'll see how it goes from there.
Loading data : you could use the other outputs of XLSREAD; the first is the numeric part, the second the text, and the third raw data (a mix of text and numbers).
[num,txt,raw] = xlsread( 'DMdata.xls' ) ;
If we look at what we get with that, indexing a 3 by 6 block (so we see the top-left part of the output of XLSREAD):
>> raw(1:3,1:6)
ans =
3×6 cell array
Columns 1 through 4
[ NaN] 'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…'
'Q2 1950' 'NA' 'NA' 'NA'
'Q3 1950' 'NA' 'NA' 'NA'
Columns 5 through 6
'BG UNEMPLOYMENT R…' 'CN UNEMPLOYMENT:R…'
'NA' 'NA'
'NA' 'NA'
We see that in cell array raw we have everything and the NA values are strings. Doing the same thing for the num array, we get:
>> num(1:3,1:6)
ans =
NaN NaN NaN NaN 4.8000 NaN
NaN NaN NaN NaN 4.4300 NaN
NaN NaN NaN NaN 4.0700 NaN
Well, that's interesting: it appears that the first row doesn't correspond to Q2 1950 but to Q1 1995, so XLSREAD outputs in fact the smallest block that encompasses all numeric values. Good to know, because if you need to associate a date to each row, you'll have to "sync" with what XLSREAD does if you use its numeric output.
Processing row and column headers : taking a little detour, let's imagine that we need the dates and the column headers. The cell array of all dates is column 1 of cell array raw, elements 2 to the end: raw(2:end,1). If we need the quarter IDs and the years as distinct numbers, many approaches are possible. I propose the following: we can use SSCANF to read/convert numbers from a string; let's see how it works on the first date entry: raw{2,1}:
>> sscanf( raw{2,1}, 'Q%d %d' )
ans =
2
1950
Simple, now we need to see if we can apply this to each element of raw(2:end,1). We could loop, or use CELLFUN as follows (more concise):
>> temp = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;
where
- @(x) sscanf( x, 'Q%d %d' ) : is an anonymous function of x. It is almost the same as creating an M-File: function numValues = myScan( x ), numValues = sscanf( x, 'Q%d %d' ) ; end and passing its handle to CELLFUN (handles are roughly "function pointers" in MATLAB, they are one way to pass a function to another function), with the advantage that we can declare it inline with no name.
- raw(2:end,1) : cell array on which CELLFUN must operate. As a reminder, {} indexing accesses the content of a cell, and () indexing is normal block indexing: it returns the (sub-) cell array indexed (and not the content of cells).
- 'UniformOutput', false : tell CELLFUN to output a cell array of outputs of the anonymous function, which is required because the function outputs an array (e.g. [2; 1950] and not a scalar). You'll have to experiment a little with CELLFUN to see what are its limitations.
Now we can for example concatenate the output and create a vertical numeric array out of it:
>> dates = horzcat( temp{:} ).' ;
dates =
2 1950
3 1950
4 1950
1 1951
...
We can also extract the column headers, which are stored in the first row of raw, elements 2 to the end: raw(1,2:end). This could be useful for creating legends, labels, etc.
>> headers = raw(1,2:end)
headers =
1×26 cell array
'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…' ...
This is interesting, because now we have all the dates, all the column headers, and we could work without hard coding boundaries like in your code.
>> nCols = numel( headers ) ;
>> nDates = size( dates, 1 ) ;
which can be used when defining loop boundaries:
for colId = 1 : nCols
...
end
Using dates, we can also build vectors of logicals for extracting relevant data, e.g. for a given period:
>> selectYears = dates(:,2) >= 1951 & dates(:,2) <= 1953
selectYears =
269×1 logical array
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
...
Processing data : we saw that the numeric array num is defined in a way that can be difficult to use, as XLSREAD truncates it to the smallest block that encompasses all numeric values. For this reason I generally prefer working with raw data. Assuming that you just want raw data, we can update the first call to XLSREAD by the way:
[~,~,raw] = xlsread( 'DMdata.xls' ) ;
so we don't have num and txt that we won't use. Working with raw, the data are in the block raw(2:end,2:end). Let's see how it looks at a place with a few data and a few 'NA':
>> raw(19:22,5:7)
ans =
4×3 cell array
'NA' 'NA' 'NA'
'NA' 'NA' 'NA'
'NA' [4.8000] 'NA'
'NA' [4.4300] 'NA'
we see that it contains strings and numbers. We can check it:
>> class( raw{21,6} )
ans =
double
>> class( raw{21,7} )
ans =
char
There are multiple ways to proceed here; I propose to replace all entries of type/class char by NaN, and then to convert the cell array into a numeric array. To detect if a cell content is char we can use ISCHAR. We also saw how to apply a function to all elements of a cell array:
>> temp = raw(2:end,2:end) ;
>> selectChar = cellfun( @ischar, temp ) ;
where
- @ischar is a shortcut for @(x)ischar(x) (there is a difference but it doesn't matter now).
- There is no need for the 'UniformOutput' argument, because ISCHAR outputs a scalar.
We can see that selectChar is a 269x26 array of logical
>> whos selectChar
Name Size Bytes Class Attributes
selectChar 269x26 6994 logical
that we can use to index (logical indexing) cell array temp for a replacement:
>> temp(selectChar) = {NaN} ;
Now we see what 'NA' strings were replaced with numeric NaN:
>> temp(18:21,4:6)
ans =
4×3 cell array
[NaN] [ NaN] [NaN]
[NaN] [ NaN] [NaN]
[NaN] [4.8000] [NaN]
[NaN] [4.4300] [NaN]
and we can convert this cell array to a numeric array:
>> data = cell2mat( temp ) ;
If you display data, you will see that it contains all the rows and columns present in your Excel file and that there is no truncation.
Summary: I developed a lot so it is clear, but the whole thing until here can be condensed into:
[~,~,raw] = xlsread( 'DMdata.xls' ) ;
dates = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;
dates = horzcat( dates{:} ).' ;
headers = raw(1,2:end) ;
data = raw(2:end,2:end) ; % Cell array mixed str/values.
data(cellfun( @ischar, data )) = {NaN} ; % Cell array NaN/values
data = cell2mat( data ) ; % Numeric array.
[nDates, nCols] = size( data ) ;
which outputs the following when run from scratch:
>> whos
Name Size Bytes Class Attributes
data 269x26 55952 double
dates 269x2 4304 double
headers 1x26 5050 cell
nCols 1x1 8 double
nDates 1x1 8 double
raw 270x27 871220 cell
This approach has the advantage that it outputs "clean" arrays, whose size do match: dates has 269 rows that correspond to the 269 rows of data, headers has 26 columns that correspond to the 26 columns of data, there was no truncation, and we know the size of everything programmatically (no need for hard coding "weird" boundaries).
Further processing: now we can work with this: first, many functions can work with NaN and/or have a special version and/or special parameters that make them support NaNs. However, if we needed to extract non-NaN values, e.g. from column 3, we could proceed as follows:
select = ~isnan( data(:,3) ) ; % ~ = NOT, can be used for logical indexing
values = data(select,3) ;
or directly
values = data(~isnan( data(:,3) ),3) ;
This can easily be used in a loop over columns:
for colId = 1 : nCols
values = data(~isnan( data(:,colId) ),colId) ;
... do something with values
end
Note that if we needed corresponding dates and/or quarters, we could easily get them as well:
select = ~isnan( data(:,3) ) ;
values = data(select,3) ;
datesForValues = dates(select,:) ;
or again in a loop, using the relevant header for a title:
for colId = 1 : nCols
select = ~isnan( data(:,colId ) ) ;
values = data(select,colId ) ;
datesForValues = dates(select,:) ;
results = myAnalyzer( dates(select,:), values ) ;
plot( ... ) ;
title( headers{colId} ) ;
end
My guess is that if you go over these explanations, understand logical indexing, how to operate on the various types of arrays, etc, your problem will disappear!
Cheers, Cedric
3 Comments
See Also
Categories
Find more on Data Distribution Plots in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!