Read and seperate CSV data

Hello everyone,
Im back again with a simple question which I cant fathom out.
Have a CSV file that Im writing a script to:
  1. open and read the file
  2. seperate the data i.e have tail in column A, type_ref in column B, date in column C etc
  3. plot against date and amp
% [data, txtdata] = xlsread('Radial.csv');
length = size(data);
length = length(1,1);
rowoffset = 11 % from first csv row of numerical data
I hope this makes sense as I don't know how else to explain this:
The first current line puts the csv into a [64x1] cell but does not seperate the data.
eg.
'tail,type_ref,datetime,seq,seq_no,sys_ref,ref,reg_ref,res_seq,ci_ref,chan_ref,component_ref,passband_ref,amp'
'FPG123,FLYDATA,2013-12-16 11:21:51.110,3,3,FLYDATA Engines,FLYDATA 170 No 1 Engine (9),FLYDATA Engine Ground,7,FLYDATA SERIAL1,FLYDATA No 1 Engine Radial,FLYDATA No 1 Engine Shaft ,FLYDATA No Freq Passband,0.72825962305069'
The above example shows all the headings in one row and column whilse all the other data follows under from row two onwards.
I have been trying to get the data to show in either seperate cells ie:
tail type_ref datetime seq etc......... ( 14 columns altogether)
FPG123 FLYDATA 2013-12-16 11:21:51.110 3
FPG126 FLYDATA 2014-12-17 11:28:32.101 3
FPG128 FLYDATA 2015-12-02 11:28:47.100 3
I have tried using the delimiter by calling each heading as in :
% [tail type_ref datetime seq] = csvimport('Radial.csv', columns, [1:4],'delimiter',',') % only for 4 headings and data
or show all the data on one sheet. Really could appreciate some help on this. Many thanks.

5 Comments

I can't find a Matlab function called csvimport...
Please attach a sample csv file, the first few lines.
Image Analyst
Image Analyst on 8 Feb 2016
Edited: Image Analyst on 8 Feb 2016
Can you try to avoid having the whole line enclosed in single quotes? If so, then try readtable(). If not, then perhaps textscan() with the proper format specifier string. Attach the file so people might try some things.
Stephen23
Stephen23 on 9 Feb 2016
Edited: Stephen23 on 9 Feb 2016
Here is an almost identical question, where each line was written as one string between quotation marks (and the user also thought it was a CSV file):
What tool creates these terrible file formats? Who writes something that can't even get a CSV file right, and is so buggy that it places the whole line within one set of quotation marks (and ignored fields that really were strings). Very puzzling.
dpb
dpb on 9 Feb 2016
Edited: dpb on 9 Feb 2016
I don't know, but my first guess would be Excel if somebody pasted the whole thing record-by-record into a column without using the import tool to split columns...I think that's what it would then do on output to csv as each cell would have to be a string.
Well, let's just try...hmmm, ok, that's what happens excepting--this version of Excel, at least, used double quotes instead of single ones as the delimiter. Not sure if that's a user-settable option or dependent on locale, maybe.

Sign in to comment.

Answers (2)

You have a problem that your file (at least the fixedtest.csv that you posted) is UTF-16 Little Endian, which is an encoding format that MATLAB does not support. It needs to be translated. I wrote a routine that detects UTF encoding and offers a conversion routine from bytes; I have attached the source.
Using that routine:
input_file = 'fixedtest.csv';
[file_encoding, bytes_per_char, BOM_size, bytes2char] = detect_UTF_encoding(input_file);
if isempty(file_encoding)
error('No usable input file');
end
fid = fopen(input_file,'rt');
fread(fid, [1, BOM_size], '*uint8'); %skip the Byte Order Mark
thisbuffer = fgets(fid);
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
%in little-endian modes, newline would be found in first byte and the 0's after need to be read
thisbuffer = [thisbuffer, fread(fid, [1, bytes_per_char - extra], '*uint8')];
end
thisline = bytes2char(thisbuffer);
data_cell = textscan(thisline, '%s', 'delimiter', ','); %will ignore the end of lines
header_fields = reshape(data_cell{1}, 1, []);
num_field = length(header_fields);
thisbuffer = fread(fid, [1 inf], '*uint8');
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
thisbuffer = [thisbuffer, zeros(1, bytes_per_char - extra, 'uint8')];
end
thisline = bytes2char(thisbuffer);
fmt = repmat('%s', 1, num_field);
data_cell = textscan(thisline, fmt, 'delimiter', ',');
data_fields_text = horzcat(data_cell{:});
data_fields = data_fields_text;
data_fields(:,3) = num2cell(datenum(data_fields_text(:,3))); %or use datetime() instead of datenum
data_fields(:,4) = num2cell(str2double(data_fields_text(:,4)));
data_fields(:,5) = num2cell(str2double(data_fields_text(:,5)));
data_fields(:,9) = num2cell(str2double(data_fields_text(:,9)));
data_fields(:,13) = num2cell(str2double(data_fields_text(:,13)));
data_table = cell2table(data_fields, 'VariableNames', header_fields);

4 Comments

Good morning Walter, many thanks for your help on this. Need to ask a couple of questions.
Just to make sure, i am running Matlab version 2015b.
I have run your routine to detect UTF encoding and need to make sure I am running this correctly (please excuse my slow brain).
The first line of the source code:
function [encoding, bytes_per_char, BOM_size, byte2char] = detect_UTF_encoding(FILENAME)
With the routine, when you press the run button do you enter the FILENAME 'fixedtest.csv'.
Question 1
I have tried this and this error comes up :-
>> Unicodedecoder( 'fixedtest.csv')
Error: File: Unicodedecoder.m Line: 69 Column: 1
This statement is not inside any function.
(It follows the END that terminates the definition of the function "detect_UTF_encoding".)
Reason for asking as I have tried the filename that 'fixedtest.csv' is currently stored in and this spits out...
undefined function or variable.
Question 2
With the sourse code in place do i include and run the second routine you have included i.e
input_file = 'fixedtest.csv';
[file_encoding, bytes_per_char, BOM_size, bytes2char] = detect_UTF_encoding(input_file);
if isempty(file_encoding)
error('No usable input file');
end
etc...etc
Does this code run after the first source?
Reason why I ask a silly question is that
input_file
comes up with
Parse error at 'input_file': usage might be invalid MATLAB syntax.
When I run the whole sourse the following error comes up:
>> Unicodedecoder('fixedtest.csv')
Error: File unicodedecoder.m Line 70 Column: 1
This statment is not inside any function.
(It follwos the END that terminates the definition of the function " Unicodedecoder".)
I hope I have explained the issues I am having, again I thank you for your help.
"Unicodedecoder" is not my source code. My detect_UTF_encoding code only has 59 lines and the script I posted only has 41 lines. If you stored that script into Unicodedecoder.m then you must have added a "function" line to it as what I posted was a script that would not take arguments.
If you did create such a function and if you put the detect_UTF_encoding into the same file instead of its own file, then you might not have created your function properly. In particular my detect_UTF_encoding.m uses the newer syntax in which there is an "end" matching each "function" line; if you put that code onto the bottom of a "function" you created which did not use that syntax, then you could get that error about the end being out of place (but more correct would have been an error about attempting to mix static functions with legacy functions.) Just leave it in its own file and you will not have to worry about whether it uses that syntax or not.
Here is a simple way I made to detect and write BOM to unicode files:
Usage:
CurrentBOM = BOM(UnicodeFileName, Encoding, FileToWrite)

Sign in to comment.

dpb
dpb on 8 Feb 2016
Edited: dpb on 8 Feb 2016
Your basic problem is the file was created as a text file of multiple strings which include the comma delimiters within them, not as a csv file, strictly speaking.
But, you can deal with it fairly simply...
>> fid=fopen('dan.csv'); % open a copy of your two lines
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr =
{14x1 cell}
>> hdr{:}
ans =
'tail'
'type_ref'
'datetime'
...[elided middle values for brevity]...
'passband_ref'
'amp''
>> dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
dat =
{1x13 cell} [7.2826e-01]
>> fid=fclose(fid);
Only tricky part is the addition of the single quote mark into the whitespace list; otherwise you'll end up with an extra tick for the first and last header names and potentially the last floating point value will error (altho I didn't test it; I'm sure it'll work for the first record, whether the internal record position will be ok to do subsequent lines or not I don't know for certain. The sure way is to ensure it's skipped as above.)

16 Comments

Dan Howard
Dan Howard on 9 Feb 2016
Edited: dpb on 9 Feb 2016
Many thanks dpb,
Im rather new to Matlab so trying to understand your functions in your code.
I have run the code breaking down each section but there are a few errors and whilst 2015a version helps you fix this Im not sure what the code is doing.
Apologies for asking stupid questions, shall break down each below:
Question one.
First part I understand, does it make a difference if you use fid or fileID?
% fid = fopen('dan.csv');
Question two.
hdr, does this refer to reading the column header from the worksheet? Could you brielfly explain what this pience of code is doing? Unfortunately Matlab threw an error here stating:
(Error using _textscan_
Invalid file identifier. Use fopen to generate a valid file identifier.) - here used fopen but failed with error stating too many input arguments.
% hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
Question three.
Could you explain what the code is doing in a little more detail?
% dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' '''']
I sincerely apologize for being rather ignorant and I'm guessing I may have swum straight into the deep end.
Again, I thank you for your help.
The choice of variable name to hold the file handle (the result of fopen) is up to you as is naming any variable (ignoring the problem of aliasing supplied functions, etc., for the time being as a distraction). You do, however, have to use the same variable later as you choose initially to have the result (which I'd think obvious?).
I generally opt for shorter names, particularly for temporary variables or for ones which are kinda' common like x, y for data to plot, for example. Hence, I almost always start with fid for a file handle and modify it w/ a trailing character/number if need more than one at a time. You can use whatever naming convention seems most natural to you; there's no absolute right or wrong here.
Your error indicates that fopen failed or that you did use a different variable than fid; don't know which as you didn't show the code in sequence as you ran it. I note you've got the lines commented above so they weren't executed as is...make sure first you do get the file opened by testing the return value from fopen using whatever variable you chose.
Yes the textscan line is returning the header line from the file -- as noted, I took the two sample lines you posted and put them in a file and the above line reads the first line of 14 character strings and stores them in a cell array hdr.
dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' '''']
The last line reads the next (and subsequent if there were more data in the file) records and returns another cell array which will consist of 13 cellstring values plus one double per record. The
[repmat('%s',1,13) '%f']
builds the format string of 13 string values followed by a floating point number that matches the file record structure. 'delimiter',',' should be obvious; it's telling textscan it's a comma-delimited file. 'whitespace',[' \b\t' ''''] is just a little tricky, but it's the "fixup" for the badly-formed file format you have of each record being a single string, not a csv file. It says "white space" characters are the default space, backspace, and tab plus a single quote mark. Since Matlab uses the single quote to delimit strings, to have a string containing one, you must "double-up" it and then surround it by the delimiters. That ends up with four single quotes in a row to result in one. Fortunately, as the Dorito ad says, "We'll make more!"
Read the documentation for fopen and friends and textscan carefully as well as looking up the details of the other functions...it's straightforward but is, granted, a lot of detail to absorb initially. The unfortunate thing is that 'puters despite all the amenities of RAD systems like Matlab, still are very picky about the details. Also, whoever saddled you with the file structure as is didn't do you any favors; as IA notes it would be better if it were generated as a "real" csv file.
if true
% fid = fopen('dan.csv');
hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
end
This line is working fine, i made a slight error. The output is displayed as all headings are in one column.
My question is how do you get the corresponding data to match up with each heading as shown in the excel picture.
Many thanks.
dpb
dpb on 9 Feb 2016
Edited: dpb on 9 Feb 2016
By default textscan returns the character cell array as an Nx1 column array as shown by the result at the command line example I gave earlier...
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr =
{14x1 cell}
>>
If you want it oriented the other way, then append the transpose operator .' after the call--
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1).'
hdr =
{1x1 cell}
>>
Then you get one cell but it contains a subarray of 14 elements so you address them as, for example,
>> hdr{1}
ans =
Columns 1 through 11
'tail' 'type_ref' 'datetime' 'seq' 'seq_no' 'sys_ref' 'ref' 'reg_ref' 'res_seq' 'ci_ref' 'chan_ref'
Columns 12 through 14
'component_ref' 'passband_ref' 'amp''
>> hdr{1}(1)
ans =
'tail'
>>
As for display, Matlab is not a spreadsheet. I think the newer releases with the table class have some presentation features but I have only to R2012b installed here so can only refer you to the documentation...
ADDENDUM I note that even the 'whitespace' argument didn't remove the trailing quote on the header line. That appears to be a bug in textscan or perhaps it's another one of those inscrutable C fscanf rules that it behaves "as expected" by the C Standard even though it's not what one would expect from the syntax.
I also tried the explicit format statement to ignore a specific character and it had the same result--
q=char(39); % ' character shorthand
fmt=[q repmat('%s',1,14) q]; % format string to skip leading/trailing "'"
It works for the first column header but not for the last; if that's an annoyance looks like you'll have to do a strrep on that element in the header (or, of course, fix the input file format)
Thank you agian for all your help on this.
Managed to get eveything working interms of headings but now need to get the data from the csv file under the corresdonding headings...
Can you load the data from the csv into the headings?
as an example below:
if true
% tail type_ref op_seq
ZZ123 13 Engine09
ZZ123 13 Engine09
ZZ123 13 Engine09
end
dpb
dpb on 11 Feb 2016
Edited: dpb on 11 Feb 2016
Don't know what "working in terms of headings" means, sorry. What specific steps have you taken?
It still appears you're trying to turn Matlab into a spreadsheet -- what is the end objective you're trying to accomplish? If it's to do some analysis on the data, then keeping that data as numeric and doing whatever on it is the way to proceed. If it's to output for human consumption a printed or displayed table, that's something else again.
ADDENDUM On reflection, I presume you likely mean that you did get the header strings imported into a table and wonder about the data to go with it. I think the best thing since your data are mixed up between categorical and numeric is to fix the original file. Then you should be able to read it with readtable and manipulate it. I don't, however, have a version recent enough to include it so can't test all the way through. However, it's pretty simple to fix up the file if you can't create it correctly to begin with, just read it as is and remove the delimiting single quotes and rewrite...there are two ways to do this, one is to treat it as text and read as cell strings, the other will be faster on larger files just treats the content as bytes...
A)
f=textread('yourbadlyformatted.csv','%s','delimiter','\n'); % cell string array
f=strrep(f,'''',''); % remove the single quotes
fid=fopen('yournewcsvfile.csv','wt'); % make a new file
for i=1:size(f,1)
fprintf(fid,'%s\n',char(f(i));
end
fid=fclose(fid);
B) For really large files will be fast as can get and is somewhat less code
fid=fopen('yourbadlyformatted.csv','r'); % open the bum file
f=fread(fid,'*char'); % read as string of bytes
fclose(fid) % done with it
f(f=='''',''); % remove the quotes again
fid=fopen('newgoodfile.csv','w'); % get ready to write new one
fwrite(fid,f); % and write the remaining bytes
fid=fclose(fid); % done
Now use readtable on the new file with the appropriate formatting string and using the first record as the header line and you likely will have what you wish.
Many thanks again.
I let my grammer get the best of me, guess I was getting annoyed.
I have tried the your option B as I will eventually be working on large files of data ranging from 100 lines to 200 000.
Reason for pulling up the data like this as I will be plotting the analysis and plotting graphs. The analysis will be on datetime and amp to calculate the mean and standard deviation. Then use the data to plot graphs.
The lonog term is to use Matlab interface to upload the file, automatically run the analysis, filter data, run the graphs and edit as needed. So taking small steps for now to get there. I appreciate your help with this.
fid=fopen('fixedtest.csv','r'); % open the bum file
f=fread(fid,'*char'); % read as string of bytes
fclose(fid) % done with it
f(f=='''',''); % remove the quotes again
fid=fopen('newfile.csv','w'); % get ready to write new one
fwrite(fid,f); % and write the remaining bytes
fid=fclose(fid);
When using readtable....
f = readtable('newfile.csv');
This gives an error..
Error using readtable
Cannot interpret data in the file 'newfile.csv'. Found 2 variable names but 1 data columns. You many need to specify a different format string, delimiter, or number of header of header lines.
Just to ask, when using readtable would I have to run this first before using readtable..
fid=fopen('newfile.csv'); % open a copy
hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
Have included the csv file fixedtest.csv as a test bench. my initial plan was to split the data up with the corresponding headings that will be displayed in the workspace, then the data can be manipulated from there.
Reason for having all the data displayed on one page, i thought it would help create a more automated working area and filter out the necessary data that is needed.
By using your option B this outputs f as 1 X 7650 char.
Still struggling to get
tail_no ac_type-ref op_datetime .....etc
xxxxx xxxxxxxxx 2015-04-22 16:59:41.840
xxxxx xxxxxxxxx 2015-04-22 16:59:41.840
xxxxx xxxxxxxxx 2015-04-22 16:59:41.840
xxxxx xxxxxxxxx 2015-04-22 16:59:41.840
By using readtable would this output the above example as:
tail_no 1x4 cell
ac_type_ref 1x4 cell
op_datetime 1x4 cell
Stephen23
Stephen23 on 12 Feb 2016
Edited: Stephen23 on 12 Feb 2016
@Dan Howard: just a tip for using this forum: first select the entire code text, and then click the {} Code button. When you do this backwards you get this filler text:
if true
% code
end
which just makes your code harder to follow. Do you see that you have this text in all of your code? That filler text is not markup and is not required to display text as code. It is intended to show where code can go. (In fact all you really need is to put two spaces at the start of each code line, as this is all the button really does)
Thank you Stephen.
You again don't show what you do, specifically, so I'm left guessing from the description which is difficult at best to infer from but...
A) If you use readtable then, no, you don't first try to read the same file with textscan.
B) "Option B" is simply a filter to first remove the surrounding quote mark in the original file so any file i/o function has a chance't to work. f is not fit for human consumption; it's just a string of bytes w/o those marks to be written to a new file which is then the one you want to use to try to read.
C) The attached file doesn't have those quote marks in it -- so is that because it was the real file and you pasted the former from somewhere else or was pasted from the command window and included the quote marks from Matlab that it puts around a cell string for display and so were never there to begin with? I'm beginning to think the latter. If that were the case, "Option B" will have no effect and therefore isn't needed. OTOH, if the "fixed" in the name implies it is a corrected file and there was another "unfixed" version with the quotes, then if shows the filter did its job and this is the file you should use for input as outlined.
D) I, unfortunately, do not have a release that includes the table class so I can't actually read the file that way here but the above file should be directly importable by it from my understanding of it by reading the doc's.
I sincerely apologise to you sir, I am very new to Matlab and its taking me time to understand. I have clearly made an error somwhere that has messed things up and confused things.
I dont know how to resolve this.
dpb
dpb on 15 Feb 2016
Edited: dpb on 15 Feb 2016
Well, post the code and any errors and the explanation of the uncertainties outlined above, starting with the "real" answer regarding the existence of or lack of the spurious quote marks in the input file...
Ok, thanks dpb.
I think I have a better understanding of what you have said.
By using your method below does exactly what you said and opens and reads a new file:
f=textread('fixedtest.csv','%s','delimiter','\n'); % cell string array
f=strrep(f,'''',''); % remove the single quotes
fid=fopen('newfile.csv','wt'); % make a new file
for i=1:size(f,1)
fprintf(fid,'%s\n',char(f(i));
end
fid=fclose(fid);
Can I ask that when I get the error below, is this due to the file (fixedtest.csv) not having quote marks in each string and numeric value?
1. When i run the code above I get an error stating..
Error using strrep
Cell elements must be character arrays.
Error in filerun (line18)
f=strrep(f,'''','');
2. If i uncomment
%f=strrep(f,'''','');
This gives an error but still opens the newfile.csv.
Error using char
Cell elements must be character arrays.
Error in filerun (line 21)
fprint(fid,'%s\n',char(f(i)));
So I thought I would try using readtable with newfile.csv:
>> readtable('newfile.csv')
ans =
empty O-by-O table
I have included each step in what I have done.
Do you know why this is returning an empty file?
dpb
dpb on 18 Feb 2016
Edited: dpb on 18 Feb 2016
See Walter Roberson's Answer -- he recognized a problem I didn't as I just opened the above file in Excel to see whether it did or did not have the quote marks as given in your very first post. That encoding keeps the Matlab normal i/o functions from functioning correctly.
I can't answer why you're getting the strrep error; it works here w/o error but on trying on the actual file textread only returns a few characters instead of the whole file.
Looking back, I note that you used xlsread originally and successfully read the text; the problem is the data weren't separated by cell in Excel. Thus xlsread can read the file as it is stored so you can either use Walter's converter script/function or stick with an Excel solution.
Probably still the simplest since you have an Excel spreadsheet and your apparent level of familiarity with the tools (assuming you do have Excel and not just the file) is to use the import tool in it when you open the file with it and convert the single cell per row into a regular spreadsheet format and save that then read that file as a "real" .xls file. Then xlsread will return the values separated by column to be imported into the table class.
Or, once you've done the conversion, presuming as noted that it's not something about your OS/locale creating the specific file encoding, if you save the spreadsheet as a .csv file where each column is a cell, then the direct readtable solution should work.
Either way, you've got to work around the Matlab limitation for the file encoding as Walter discovered.
I'm presuming that the resulting file will not also be UTF-16 Little Endian here...if that's inherent in your system for some reason owing to a locale setting, or somesuch, you're going to have much difficulty with Matlab it would seem.

Sign in to comment.

Asked:

on 8 Feb 2016

Commented:

on 21 May 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!