How to use setvaropts for strangely-formatted text file
12 views (last 30 days)
Show older comments
Hello All,
Long time reader, first time asker - appreciate the great wealth of knowledge here!
I'm trying to use the "readtable" funciton to import data from a large (~600 MB) CSV file. The file contains separate columns for Date, Time, and a bunch of instrument channels. I've successfully used "detectImportOptions" to get the channel listing, eliminate unwanted columns using "SelectedVariableNames", and set the "VaribleTypes" for all of the desired columns. Note: by default, the function correclty flags the "Time" column as a duration, but the "Date" and individual data columns get flagged as "char" by default.
Here's an example line from the file, showing the header row, units row, and subsequent columns. I'm wanting to import "Date" (as a date/time), "Time" (as the default duration), and each of hte "CHU..." columns as a data channel.
Number,Date,Time,us,CHU1-1,CHU1-2,CHU1-3,CHU1-4,CHU1-5...
NO.,Date,Time,us,"uE","uE","uE","uE","uE",...
1,2014/10/30,17:56:41, 0,+ 194 ,+ 210 ,+ 196 ,- 180 ,+ 17 ,...
For the "Date" variable, I'm attempting to use the following code:
opts=setvaropts(opts,'Date','DatetimeFormat','yyyy/MM/dd');
However, despite the above, all of the dates end up importing as NaT. For the "CHU..." channels, I have changed the VariableTypes to double:
idx=find(contains(vnms,'CHU'));
ivars=[ivars idx];
opts.VariableTypes(idx)={'double'};
However, this also is not working, as I'm getting NaN for all values.
Can someone help me with the correct format/inputs to the "setvaropts" function to import the following?
- "2014/10/30" as Date/Time
- "+ 194 " as Signed Double.
Thanks much for any help!
Mark
------------------
Edit: attached a copy of a smaller-sized file (cannot upload the full 600 MB CSV that I'm actually trying to read, but I have verified that this one is formatted the same and returns similar results).
Here's the current version of my import code ('fpath' and 'fname' variables are identified from "uigetfile" call earlier in the script):
%determine import options and adjust
warning off;
opts=detectImportOptions(fullfile(fpath,fname));
ivars=[];
vnms=opts.VariableNames;
idx=find(ismember(vnms,'Date'));
ivars=[ivars idx];
opts.VariableTypes{idx}='datetime';
opts=setvaropts(opts,'Date','DatetimeFormat','yyyy/MM/dd');
idx=find(ismember(vnms,'Time'));
ivars=[ivars idx];
opts.VariableTypes{idx}='duration';
idx=find(contains(vnms,'CHU'));
ivars=[ivars idx];
opts.VariableTypes(idx)={'double'};
opts.SelectedVariableNames=vnms(ivars);
%read from CSV table
tt=readtable(fullfile(fpath,fname),opts);
4 Comments
Stephen23
on 9 Feb 2024
Edited: Stephen23
on 9 Feb 2024
Just to confirm: the block of data you wish to import starts on line 120 (header) & 122 (data) to the EOF ?
Those are... rather strangely formatted numbers. Out of personal curiosity: what produced that number format? Do you know what language or formatting commands were used?
Accepted Answer
Stephen23
on 9 Feb 2024
Edited: Stephen23
on 9 Feb 2024
First lets see if READTABLE et al can ignore those space characters on a simple test file:
type testformat.csv
readtable('testformat.csv','Delimiter',',', 'TrimNonNumeric',true) % no minus :(
Hmm, not quite. Nor do I see any other options that would obviously help with those oddly formatted numbers. I get the feeling, that you might need to import those values as text, do some (simple) text munging, and then convert to numeric.
Note that CONVERTVARS could probably do this for you in one call:
- define the 2nd input as a PATTERN object that matches "CH" at the start of the variable name.
- define the 3rd input as a function that accepts a text array (string or cell of char), removes those spaces, and then STR2DOUBLEs the whole thing.
This is a very useful approach that avoids the need to e.g. hardcode column numbers or the like:
T = readtable('testformat.csv','Delimiter',',') % backup option
P = "CH" + wildcardPattern;
F = @(txt)str2double(regexprep(txt,'\s+',''));
T = convertvars(T,P,F)
DATETIME: I think that as long as you set the delimiter READTABLE should be able to figure out that DATETIME format. If it doesn't, first use SETVARTYPE to set the "Date" variable to DATETIME, then use SETVAROPTS to set the format (as you showed).
I need to take a break now, but that should get you started. I will take a look later and see if you need support with the DATETIME.
More Answers (1)
Cris LaPierre
on 9 Feb 2024
You need to set the 'InputFormat' instead of the 'DatetimeFormat'.
opts = detectImportOptions('chuSampleFile.csv');
opts = setvartype(opts,'Date','datetime');
opts = setvaropts(opts,'Date','InputFormat','yyyy/MM/dd');
As for converting your numeric data to doubles, I think that is easiest done after the table has been imported. To capture the sign and number, read these in first as strings.
idx=contains(opts.VariableNames,'CHU');
opts.VariableTypes(idx)={'string'};
data = readtable('chuSampleFile.csv',opts)
data(:,idx) = varfun(@(x)erase(x,' '),data,"InputVariables",idx)
data = convertvars(data,idx,'double')
3 Comments
Cris LaPierre
on 9 Feb 2024
Edited: Cris LaPierre
on 9 Feb 2024
The 'can't convert cell to double' happens if you do not set the numbers to import as strings.
idx=contains(opts.VariableNames,'CHU');
opts.VariableTypes(idx)={'string'};
You could instead just set the text type to be string
opts = detectImportOptions('DEFAULT_CP3.CSV','TextType','string');
opts = setvartype(opts,'Date','datetime');
opts = setvaropts(opts,'Date','InputFormat','yyyy/MM/dd');
data = readtable('DEFAULT_CP3.CSV',opts);
idx = contains(data.Properties.VariableNames,'CHU');
data(:,idx) = varfun(@(x)erase(x,' '),data,"InputVariables",idx);
data = convertvars(data,idx,'double')
See Also
Categories
Find more on Text Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!