Read and seperate CSV data

Question

0 votes

Hello everyone,

Im back again with a simple question which I cant fathom out.

Have a CSV file that Im writing a script to:

open and read the file
seperate the data i.e have tail in column A, type_ref in column B, date in column C etc
plot against date and amp

% [data, txtdata] = xlsread('Radial.csv');
length = size(data);
length = length(1,1);
rowoffset = 11 % from first csv row of numerical data

I hope this makes sense as I don't know how else to explain this:

The first current line puts the csv into a [64x1] cell but does not seperate the data.

eg.

'tail,type_ref,datetime,seq,seq_no,sys_ref,ref,reg_ref,res_seq,ci_ref,chan_ref,component_ref,passband_ref,amp'
'FPG123,FLYDATA,2013-12-16 11:21:51.110,3,3,FLYDATA Engines,FLYDATA 170 No 1 Engine (9),FLYDATA Engine Ground,7,FLYDATA SERIAL1,FLYDATA No 1 Engine Radial,FLYDATA No 1 Engine Shaft ,FLYDATA No Freq Passband,0.72825962305069'

The above example shows all the headings in one row and column whilse all the other data follows under from row two onwards.

I have been trying to get the data to show in either seperate cells ie:

tail       type_ref     datetime                   seq     etc......... ( 14 columns altogether)
FPG123     FLYDATA      2013-12-16 11:21:51.110    3
FPG126     FLYDATA      2014-12-17 11:28:32.101    3
FPG128     FLYDATA      2015-12-02 11:28:47.100    3

I have tried using the delimiter by calling each heading as in :

% [tail type_ref datetime seq] = csvimport('Radial.csv', columns, [1:4],'delimiter',',')   % only for 4 headings and data

or show all the data on one sheet. Really could appreciate some help on this. Many thanks.

5 Comments
Show 3 older comments Hide 3 older comments

Stephen23 on 9 Feb 2016

Edited: Stephen23 on 9 Feb 2016

Here is an almost identical question, where each line was written as one string between quotation marks (and the user also thought it was a CSV file):

http://www.mathworks.com/matlabcentral/answers/265753-importing-a-csv-to-matlab-delimiter-problem

What tool creates these terrible file formats? Who writes something that can't even get a CSV file right, and is so buggy that it places the whole line within one set of quotation marks (and ignored fields that really were strings). Very puzzling.

dpb on 9 Feb 2016

Edited: dpb on 9 Feb 2016

I don't know, but my first guess would be Excel if somebody pasted the whole thing record-by-record into a column without using the import tool to split columns...I think that's what it would then do on output to csv as each cell would have to be a string.

Well, let's just try...hmmm, ok, that's what happens excepting--this version of Excel, at least, used double quotes instead of single ones as the delimiter. Not sure if that's a user-settable option or dependent on locale, maybe.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 15 Feb 2016

Open in MATLAB Online

1 vote

detect_UTF_encoding.m

You have a problem that your file (at least the fixedtest.csv that you posted) is UTF-16 Little Endian, which is an encoding format that MATLAB does not support. It needs to be translated. I wrote a routine that detects UTF encoding and offers a conversion routine from bytes; I have attached the source.

Using that routine:

input_file = 'fixedtest.csv';
[file_encoding, bytes_per_char, BOM_size, bytes2char] = detect_UTF_encoding(input_file);
if isempty(file_encoding)
   error('No usable input file');
end
fid = fopen(input_file,'rt');
fread(fid, [1, BOM_size], '*uint8');   %skip the Byte Order Mark
thisbuffer = fgets(fid);
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
  %in little-endian modes, newline would be found in first byte and the 0's after need to be read
  thisbuffer = [thisbuffer, fread(fid, [1, bytes_per_char - extra], '*uint8')];
end
thisline = bytes2char(thisbuffer);
data_cell = textscan(thisline, '%s', 'delimiter', ',');   %will ignore the end of lines
header_fields = reshape(data_cell{1}, 1, []);
num_field = length(header_fields);
thisbuffer = fread(fid, [1 inf], '*uint8');
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
  thisbuffer = [thisbuffer, zeros(1, bytes_per_char - extra, 'uint8')];
end
thisline = bytes2char(thisbuffer);
fmt = repmat('%s', 1, num_field);
data_cell = textscan(thisline, fmt, 'delimiter', ',');
data_fields_text = horzcat(data_cell{:});
data_fields = data_fields_text;
data_fields(:,3) = num2cell(datenum(data_fields_text(:,3)));   %or use datetime() instead of datenum
data_fields(:,4) = num2cell(str2double(data_fields_text(:,4)));
data_fields(:,5) = num2cell(str2double(data_fields_text(:,5)));
data_fields(:,9) = num2cell(str2double(data_fields_text(:,9)));
data_fields(:,13) = num2cell(str2double(data_fields_text(:,13)));
data_table = cell2table(data_fields, 'VariableNames', header_fields);

4 Comments
Show 2 older comments Hide 2 older comments

Dan Howard on 17 Feb 2016

Edited: Dan Howard on 17 Feb 2016

Open in MATLAB Online

Good morning Walter, many thanks for your help on this. Need to ask a couple of questions.

Just to make sure, i am running Matlab version 2015b.

I have run your routine to detect UTF encoding and need to make sure I am running this correctly (please excuse my slow brain).

The first line of the source code:

function [encoding, bytes_per_char, BOM_size, byte2char] = detect_UTF_encoding(FILENAME)

With the routine, when you press the run button do you enter the FILENAME 'fixedtest.csv'.

Question 1

I have tried this and this error comes up :-

>> Unicodedecoder( 'fixedtest.csv')
Error: File: Unicodedecoder.m Line: 69 Column: 1
This statement is not inside any function.
 (It follows the END that terminates the definition of the function "detect_UTF_encoding".)

Reason for asking as I have tried the filename that 'fixedtest.csv' is currently stored in and this spits out...

undefined function or variable.

Question 2

With the sourse code in place do i include and run the second routine you have included i.e

   input_file = 'fixedtest.csv';
[file_encoding, bytes_per_char, BOM_size, bytes2char] = detect_UTF_encoding(input_file);
if isempty(file_encoding)
   error('No usable input file');
end
etc...etc

Does this code run after the first source?

Reason why I ask a silly question is that

input_file

comes up with

Parse error at 'input_file': usage might be invalid MATLAB syntax.

When I run the whole sourse the following error comes up:

   >> Unicodedecoder('fixedtest.csv')
   Error: File unicodedecoder.m Line 70 Column: 1
   This statment is not inside any function.
     (It follwos the END that terminates the definition of the function " Unicodedecoder".)

I hope I have explained the issues I am having, again I thank you for your help.

Walter Roberson on 2 Mar 2016

See a better method http://uk.mathworks.com/matlabcentral/answers/270857-how-to-read-16-bit-text-with-matlab#answer_211879

Noam Greenboim on 21 May 2020

Open in MATLAB Online

Here is a simple way I made to detect and write BOM to unicode files:

https://www.mathworks.com/matlabcentral/fileexchange/75708-update-bom-for-unicode-text-files

Usage:

CurrentBOM = BOM(UnicodeFileName, Encoding, FileToWrite)

Sign in to comment.

Answer 2

dpb on 8 Feb 2016

Edited: dpb on 8 Feb 2016

Open in MATLAB Online

0 votes

Your basic problem is the file was created as a text file of multiple strings which include the comma delimiters within them, not as a csv file, strictly speaking.

But, you can deal with it fairly simply...

>> fid=fopen('dan.csv');  % open a copy of your two lines
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr = 
  {14x1 cell}
>> hdr{:}
ans = 
  'tail'
  'type_ref'
  'datetime'
  ...[elided middle values for brevity]...
  'passband_ref'
  'amp''
>> dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
dat = 
  {1x13 cell}    [7.2826e-01]
>> fid=fclose(fid);

Only tricky part is the addition of the single quote mark into the whitespace list; otherwise you'll end up with an extra tick for the first and last header names and potentially the last floating point value will error (altho I didn't test it; I'm sure it'll work for the first record, whether the internal record position will be ok to do subsequent lines or not I don't know for certain. The sure way is to ensure it's skipped as above.)

16 Comments
Show 14 older comments Hide 14 older comments

Dan Howard on 9 Feb 2016

Edited: dpb on 9 Feb 2016

Open in MATLAB Online

Many thanks dpb,

Im rather new to Matlab so trying to understand your functions in your code.

I have run the code breaking down each section but there are a few errors and whilst 2015a version helps you fix this Im not sure what the code is doing.

Apologies for asking stupid questions, shall break down each below:

Question one.

First part I understand, does it make a difference if you use fid or fileID?

% fid = fopen('dan.csv'); 
 Question two.

hdr, does this refer to reading the column header from the worksheet? Could you brielfly explain what this pience of code is doing? Unfortunately Matlab threw an error here stating:

(Error using _textscan_
Invalid file identifier. Use fopen to generate a valid file identifier.) - here used fopen but failed with error stating too many input arguments.
% hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
 Question three.

Could you explain what the code is doing in a little more detail?

% dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' '''']

I sincerely apologize for being rather ignorant and I'm guessing I may have swum straight into the deep end.

Again, I thank you for your help.

dpb on 9 Feb 2016

Open in MATLAB Online

The choice of variable name to hold the file handle (the result of fopen) is up to you as is naming any variable (ignoring the problem of aliasing supplied functions, etc., for the time being as a distraction). You do, however, have to use the same variable later as you choose initially to have the result (which I'd think obvious?).

I generally opt for shorter names, particularly for temporary variables or for ones which are kinda' common like x, y for data to plot, for example. Hence, I almost always start with fid for a file handle and modify it w/ a trailing character/number if need more than one at a time. You can use whatever naming convention seems most natural to you; there's no absolute right or wrong here.

Your error indicates that fopen failed or that you did use a different variable than fid; don't know which as you didn't show the code in sequence as you ran it. I note you've got the lines commented above so they weren't executed as is...make sure first you do get the file opened by testing the return value from fopen using whatever variable you chose.

Yes the textscan line is returning the header line from the file -- as noted, I took the two sample lines you posted and put them in a file and the above line reads the first line of 14 character strings and stores them in a cell array hdr.

dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' '''']

The last line reads the next (and subsequent if there were more data in the file) records and returns another cell array which will consist of 13 cellstring values plus one double per record. The

[repmat('%s',1,13) '%f']

builds the format string of 13 string values followed by a floating point number that matches the file record structure. 'delimiter',',' should be obvious; it's telling textscan it's a comma-delimited file. 'whitespace',[' \b\t' ''''] is just a little tricky, but it's the "fixup" for the badly-formed file format you have of each record being a single string, not a csv file. It says "white space" characters are the default space, backspace, and tab plus a single quote mark. Since Matlab uses the single quote to delimit strings, to have a string containing one, you must "double-up" it and then surround it by the delimiters. That ends up with four single quotes in a row to result in one. Fortunately, as the Dorito ad says, "We'll make more!"

Read the documentation for fopen and friends and textscan carefully as well as looking up the details of the other functions...it's straightforward but is, granted, a lot of detail to absorb initially. The unfortunate thing is that 'puters despite all the amenities of RAD systems like Matlab, still are very picky about the details. Also, whoever saddled you with the file structure as is didn't do you any favors; as IA notes it would be better if it were generated as a "real" csv file.

dpb on 9 Feb 2016

Edited: dpb on 9 Feb 2016

Open in MATLAB Online

By default textscan returns the character cell array as an Nx1 column array as shown by the result at the command line example I gave earlier...

>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr = 
{14x1 cell}
>>

If you want it oriented the other way, then append the transpose operator .' after the call--

>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1).'
hdr = 
{1x1 cell}
>>

Then you get one cell but it contains a subarray of 14 elements so you address them as, for example,

>> hdr{1}
ans = 
Columns 1 through 11
  'tail'    'type_ref'    'datetime'    'seq'    'seq_no'    'sys_ref'    'ref'    'reg_ref'    'res_seq'    'ci_ref'    'chan_ref'
Columns 12 through 14
  'component_ref'    'passband_ref'    'amp''
>> hdr{1}(1)
ans = 
  'tail'
>>

As for display, Matlab is not a spreadsheet. I think the newer releases with the table class have some presentation features but I have only to R2012b installed here so can only refer you to the documentation...

ADDENDUM I note that even the 'whitespace' argument didn't remove the trailing quote on the header line. That appears to be a bug in textscan or perhaps it's another one of those inscrutable C fscanf rules that it behaves "as expected" by the C Standard even though it's not what one would expect from the syntax.

I also tried the explicit format statement to ignore a specific character and it had the same result--

q=char(39);                   % ' character shorthand
fmt=[q repmat('%s',1,14) q];  % format string to skip leading/trailing "'"

It works for the first column header but not for the last; if that's an annoyance looks like you'll have to do a strrep on that element in the header (or, of course, fix the input file format)

dpb on 11 Feb 2016

Edited: dpb on 11 Feb 2016

Open in MATLAB Online

Don't know what "working in terms of headings" means, sorry. What specific steps have you taken?

It still appears you're trying to turn Matlab into a spreadsheet -- what is the end objective you're trying to accomplish? If it's to do some analysis on the data, then keeping that data as numeric and doing whatever on it is the way to proceed. If it's to output for human consumption a printed or displayed table, that's something else again.

ADDENDUM On reflection, I presume you likely mean that you did get the header strings imported into a table and wonder about the data to go with it. I think the best thing since your data are mixed up between categorical and numeric is to fix the original file. Then you should be able to read it with readtable and manipulate it. I don't, however, have a version recent enough to include it so can't test all the way through. However, it's pretty simple to fix up the file if you can't create it correctly to begin with, just read it as is and remove the delimiting single quotes and rewrite...there are two ways to do this, one is to treat it as text and read as cell strings, the other will be faster on larger files just treats the content as bytes...

A)

f=textread('yourbadlyformatted.csv','%s','delimiter','\n');  % cell string array
f=strrep(f,'''','');   % remove the single quotes
fid=fopen('yournewcsvfile.csv','wt');  % make a new file 
for i=1:size(f,1)
  fprintf(fid,'%s\n',char(f(i));
end
fid=fclose(fid);

B) For really large files will be fast as can get and is somewhat less code

fid=fopen('yourbadlyformatted.csv','r');  % open the bum file
f=fread(fid,'*char');                     % read as string of bytes
fclose(fid)                               % done with it
f(f=='''','');                            % remove the quotes again
fid=fopen('newgoodfile.csv','w');         % get ready to write new one
fwrite(fid,f);                            % and write the remaining bytes
fid=fclose(fid);                          % done

Now use readtable on the new file with the appropriate formatting string and using the first record as the header line and you likely will have what you wish.

Dan Howard on 12 Feb 2016

Edited: Dan Howard on 12 Feb 2016

Open in MATLAB Online

fixedtest.csv

Many thanks again.

I let my grammer get the best of me, guess I was getting annoyed.

I have tried the your option B as I will eventually be working on large files of data ranging from 100 lines to 200 000.

Reason for pulling up the data like this as I will be plotting the analysis and plotting graphs. The analysis will be on datetime and amp to calculate the mean and standard deviation. Then use the data to plot graphs.

The lonog term is to use Matlab interface to upload the file, automatically run the analysis, filter data, run the graphs and edit as needed. So taking small steps for now to get there. I appreciate your help with this.

fid=fopen('fixedtest.csv','r');           % open the bum file
f=fread(fid,'*char');                     % read as string of bytes
fclose(fid)                               % done with it
f(f=='''','');                            % remove the quotes again
fid=fopen('newfile.csv','w');             % get ready to write new one
fwrite(fid,f);                            % and write the remaining bytes
fid=fclose(fid);

When using readtable....

f = readtable('newfile.csv');

This gives an error..

     Error using readtable
     Cannot interpret data in the file 'newfile.csv'. Found 2 variable names but 1 data columns. You many need to specify a different format string, delimiter, or number of header of header lines.

Just to ask, when using readtable would I have to run this first before using readtable..

   fid=fopen('newfile.csv');  % open a copy 
   hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
   dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',['   \b\t' ''''],'collectoutput',1)

Have included the csv file fixedtest.csv as a test bench. my initial plan was to split the data up with the corresponding headings that will be displayed in the workspace, then the data can be manipulated from there.

Reason for having all the data displayed on one page, i thought it would help create a more automated working area and filter out the necessary data that is needed.

dpb on 12 Feb 2016

You again don't show what you do, specifically, so I'm left guessing from the description which is difficult at best to infer from but...

A) If you use readtable then, no, you don't first try to read the same file with textscan.

B) "Option B" is simply a filter to first remove the surrounding quote mark in the original file so any file i/o function has a chance't to work. f is not fit for human consumption; it's just a string of bytes w/o those marks to be written to a new file which is then the one you want to use to try to read.

C) The attached file doesn't have those quote marks in it -- so is that because it was the real file and you pasted the former from somewhere else or was pasted from the command window and included the quote marks from Matlab that it puts around a cell string for display and so were never there to begin with? I'm beginning to think the latter. If that were the case, "Option B" will have no effect and therefore isn't needed. OTOH, if the "fixed" in the name implies it is a corrected file and there was another "unfixed" version with the quotes, then if shows the filter did its job and this is the file you should use for input as outlined.

D) I, unfortunately, do not have a release that includes the table class so I can't actually read the file that way here but the above file should be directly importable by it from my understanding of it by reading the doc's.

Dan Howard on 18 Feb 2016

Open in MATLAB Online

Ok, thanks dpb.

I think I have a better understanding of what you have said.

By using your method below does exactly what you said and opens and reads a new file:

f=textread('fixedtest.csv','%s','delimiter','\n');  % cell string array
f=strrep(f,'''','');   % remove the single quotes
fid=fopen('newfile.csv','wt');  % make a new file 
for i=1:size(f,1)
fprintf(fid,'%s\n',char(f(i));
end
fid=fclose(fid);

Can I ask that when I get the error below, is this due to the file (fixedtest.csv) not having quote marks in each string and numeric value?

1. When i run the code above I get an error stating..

Error using strrep
Cell elements must be character arrays.
Error in filerun (line18)
f=strrep(f,'''','');

2. If i uncomment

%f=strrep(f,'''','');

This gives an error but still opens the newfile.csv.

Error using char
Cell elements must be character arrays.
Error in filerun (line 21)
fprint(fid,'%s\n',char(f(i)));

So I thought I would try using readtable with newfile.csv:

>> readtable('newfile.csv')
ans = 
empty O-by-O table

I have included each step in what I have done.

Do you know why this is returning an empty file?

dpb on 18 Feb 2016

Edited: dpb on 18 Feb 2016

See Walter Roberson's Answer -- he recognized a problem I didn't as I just opened the above file in Excel to see whether it did or did not have the quote marks as given in your very first post. That encoding keeps the Matlab normal i/o functions from functioning correctly.

I can't answer why you're getting the strrep error; it works here w/o error but on trying on the actual file textread only returns a few characters instead of the whole file.

Looking back, I note that you used xlsread originally and successfully read the text; the problem is the data weren't separated by cell in Excel. Thus xlsread can read the file as it is stored so you can either use Walter's converter script/function or stick with an Excel solution.

Probably still the simplest since you have an Excel spreadsheet and your apparent level of familiarity with the tools (assuming you do have Excel and not just the file) is to use the import tool in it when you open the file with it and convert the single cell per row into a regular spreadsheet format and save that then read that file as a "real" .xls file. Then xlsread will return the values separated by column to be imported into the table class.

Or, once you've done the conversion, presuming as noted that it's not something about your OS/locale creating the specific file encoding, if you save the spreadsheet as a .csv file where each column is a cell, then the direct readtable solution should work.

Either way, you've got to work around the Matlab limitation for the file encoding as Walter discovered.

I'm presuming that the resulting file will not also be UTF-16 Little Endian here...if that's inherent in your system for some reason owing to a locale setting, or somesuch, you're going to have much difficulty with Matlab it would seem.

Sign in to comment.

Read and seperate CSV data

5 Comments
Show 3 older comments Hide 3 older comments

Answers (2)

4 Comments
Show 2 older comments Hide 2 older comments

16 Comments
Show 14 older comments Hide 14 older comments

Categories

Tags

Community Treasure Hunt

Read and seperate CSV data

5 Comments Show 3 older comments Hide 3 older comments

Answers (2)

4 Comments Show 2 older comments Hide 2 older comments

16 Comments Show 14 older comments Hide 14 older comments

Categories

Tags

See Also

Community Treasure Hunt

5 Comments
Show 3 older comments Hide 3 older comments

4 Comments
Show 2 older comments Hide 2 older comments

16 Comments
Show 14 older comments Hide 14 older comments