Read and seperate CSV data
Show older comments
Hello everyone,
Im back again with a simple question which I cant fathom out.
Have a CSV file that Im writing a script to:
- open and read the file
- seperate the data i.e have tail in column A, type_ref in column B, date in column C etc
- plot against date and amp
% [data, txtdata] = xlsread('Radial.csv');
length = size(data);
length = length(1,1);
rowoffset = 11 % from first csv row of numerical data
I hope this makes sense as I don't know how else to explain this:
The first current line puts the csv into a [64x1] cell but does not seperate the data.
eg.
'tail,type_ref,datetime,seq,seq_no,sys_ref,ref,reg_ref,res_seq,ci_ref,chan_ref,component_ref,passband_ref,amp'
'FPG123,FLYDATA,2013-12-16 11:21:51.110,3,3,FLYDATA Engines,FLYDATA 170 No 1 Engine (9),FLYDATA Engine Ground,7,FLYDATA SERIAL1,FLYDATA No 1 Engine Radial,FLYDATA No 1 Engine Shaft ,FLYDATA No Freq Passband,0.72825962305069'
The above example shows all the headings in one row and column whilse all the other data follows under from row two onwards.
I have been trying to get the data to show in either seperate cells ie:
tail type_ref datetime seq etc......... ( 14 columns altogether)
FPG123 FLYDATA 2013-12-16 11:21:51.110 3
FPG126 FLYDATA 2014-12-17 11:28:32.101 3
FPG128 FLYDATA 2015-12-02 11:28:47.100 3
I have tried using the delimiter by calling each heading as in :
% [tail type_ref datetime seq] = csvimport('Radial.csv', columns, [1:4],'delimiter',',') % only for 4 headings and data
or show all the data on one sheet. Really could appreciate some help on this. Many thanks.
5 Comments
dpb
on 8 Feb 2016
I can't find a Matlab function called csvimport...
Walter Roberson
on 8 Feb 2016
Please attach a sample csv file, the first few lines.
Image Analyst
on 8 Feb 2016
Edited: Image Analyst
on 8 Feb 2016
Can you try to avoid having the whole line enclosed in single quotes? If so, then try readtable(). If not, then perhaps textscan() with the proper format specifier string. Attach the file so people might try some things.
Here is an almost identical question, where each line was written as one string between quotation marks (and the user also thought it was a CSV file):
What tool creates these terrible file formats? Who writes something that can't even get a CSV file right, and is so buggy that it places the whole line within one set of quotation marks (and ignored fields that really were strings). Very puzzling.
I don't know, but my first guess would be Excel if somebody pasted the whole thing record-by-record into a column without using the import tool to split columns...I think that's what it would then do on output to csv as each cell would have to be a string.
Well, let's just try...hmmm, ok, that's what happens excepting--this version of Excel, at least, used double quotes instead of single ones as the delimiter. Not sure if that's a user-settable option or dependent on locale, maybe.
Answers (2)
Walter Roberson
on 15 Feb 2016
You have a problem that your file (at least the fixedtest.csv that you posted) is UTF-16 Little Endian, which is an encoding format that MATLAB does not support. It needs to be translated. I wrote a routine that detects UTF encoding and offers a conversion routine from bytes; I have attached the source.
Using that routine:
input_file = 'fixedtest.csv';
[file_encoding, bytes_per_char, BOM_size, bytes2char] = detect_UTF_encoding(input_file);
if isempty(file_encoding)
error('No usable input file');
end
fid = fopen(input_file,'rt');
fread(fid, [1, BOM_size], '*uint8'); %skip the Byte Order Mark
thisbuffer = fgets(fid);
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
%in little-endian modes, newline would be found in first byte and the 0's after need to be read
thisbuffer = [thisbuffer, fread(fid, [1, bytes_per_char - extra], '*uint8')];
end
thisline = bytes2char(thisbuffer);
data_cell = textscan(thisline, '%s', 'delimiter', ','); %will ignore the end of lines
header_fields = reshape(data_cell{1}, 1, []);
num_field = length(header_fields);
thisbuffer = fread(fid, [1 inf], '*uint8');
extra = mod(length(thisbuffer), bytes_per_char);
if extra ~= 0
thisbuffer = [thisbuffer, zeros(1, bytes_per_char - extra, 'uint8')];
end
thisline = bytes2char(thisbuffer);
fmt = repmat('%s', 1, num_field);
data_cell = textscan(thisline, fmt, 'delimiter', ',');
data_fields_text = horzcat(data_cell{:});
data_fields = data_fields_text;
data_fields(:,3) = num2cell(datenum(data_fields_text(:,3))); %or use datetime() instead of datenum
data_fields(:,4) = num2cell(str2double(data_fields_text(:,4)));
data_fields(:,5) = num2cell(str2double(data_fields_text(:,5)));
data_fields(:,9) = num2cell(str2double(data_fields_text(:,9)));
data_fields(:,13) = num2cell(str2double(data_fields_text(:,13)));
data_table = cell2table(data_fields, 'VariableNames', header_fields);
4 Comments
Dan Howard
on 17 Feb 2016
Edited: Dan Howard
on 17 Feb 2016
Walter Roberson
on 29 Feb 2016
"Unicodedecoder" is not my source code. My detect_UTF_encoding code only has 59 lines and the script I posted only has 41 lines. If you stored that script into Unicodedecoder.m then you must have added a "function" line to it as what I posted was a script that would not take arguments.
If you did create such a function and if you put the detect_UTF_encoding into the same file instead of its own file, then you might not have created your function properly. In particular my detect_UTF_encoding.m uses the newer syntax in which there is an "end" matching each "function" line; if you put that code onto the bottom of a "function" you created which did not use that syntax, then you could get that error about the end being out of place (but more correct would have been an error about attempting to mix static functions with legacy functions.) Just leave it in its own file and you will not have to worry about whether it uses that syntax or not.
Walter Roberson
on 2 Mar 2016
Noam Greenboim
on 21 May 2020
Here is a simple way I made to detect and write BOM to unicode files:
Usage:
CurrentBOM = BOM(UnicodeFileName, Encoding, FileToWrite)
Your basic problem is the file was created as a text file of multiple strings which include the comma delimiters within them, not as a csv file, strictly speaking.
But, you can deal with it fairly simply...
>> fid=fopen('dan.csv'); % open a copy of your two lines
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr =
{14x1 cell}
>> hdr{:}
ans =
'tail'
'type_ref'
'datetime'
...[elided middle values for brevity]...
'passband_ref'
'amp''
>> dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
dat =
{1x13 cell} [7.2826e-01]
>> fid=fclose(fid);
Only tricky part is the addition of the single quote mark into the whitespace list; otherwise you'll end up with an extra tick for the first and last header names and potentially the last floating point value will error (altho I didn't test it; I'm sure it'll work for the first record, whether the internal record position will be ok to do subsequent lines or not I don't know for certain. The sure way is to ensure it's skipped as above.)
16 Comments
Dan Howard
on 9 Feb 2016
Edited: dpb
on 9 Feb 2016
dpb
on 9 Feb 2016
The choice of variable name to hold the file handle (the result of fopen) is up to you as is naming any variable (ignoring the problem of aliasing supplied functions, etc., for the time being as a distraction). You do, however, have to use the same variable later as you choose initially to have the result (which I'd think obvious?).
I generally opt for shorter names, particularly for temporary variables or for ones which are kinda' common like x, y for data to plot, for example. Hence, I almost always start with fid for a file handle and modify it w/ a trailing character/number if need more than one at a time. You can use whatever naming convention seems most natural to you; there's no absolute right or wrong here.
Your error indicates that fopen failed or that you did use a different variable than fid; don't know which as you didn't show the code in sequence as you ran it. I note you've got the lines commented above so they weren't executed as is...make sure first you do get the file opened by testing the return value from fopen using whatever variable you chose.
Yes the textscan line is returning the header line from the file -- as noted, I took the two sample lines you posted and put them in a file and the above line reads the first line of 14 character strings and stores them in a cell array hdr.
dat=textscan(fid,[repmat('%s',1,13) '%f'],'delimiter',',','whitespace',[' \b\t' '''']
The last line reads the next (and subsequent if there were more data in the file) records and returns another cell array which will consist of 13 cellstring values plus one double per record. The
[repmat('%s',1,13) '%f']
builds the format string of 13 string values followed by a floating point number that matches the file record structure. 'delimiter',',' should be obvious; it's telling textscan it's a comma-delimited file. 'whitespace',[' \b\t' ''''] is just a little tricky, but it's the "fixup" for the badly-formed file format you have of each record being a single string, not a csv file. It says "white space" characters are the default space, backspace, and tab plus a single quote mark. Since Matlab uses the single quote to delimit strings, to have a string containing one, you must "double-up" it and then surround it by the delimiters. That ends up with four single quotes in a row to result in one. Fortunately, as the Dorito ad says, "We'll make more!"
Read the documentation for fopen and friends and textscan carefully as well as looking up the details of the other functions...it's straightforward but is, granted, a lot of detail to absorb initially. The unfortunate thing is that 'puters despite all the amenities of RAD systems like Matlab, still are very picky about the details. Also, whoever saddled you with the file structure as is didn't do you any favors; as IA notes it would be better if it were generated as a "real" csv file.
Dan Howard
on 9 Feb 2016
By default textscan returns the character cell array as an Nx1 column array as shown by the result at the command line example I gave earlier...
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1)
hdr =
{14x1 cell}
>>
If you want it oriented the other way, then append the transpose operator .' after the call--
>> hdr=textscan(fid,'%s',14,'delimiter',',','whitespace',[' \b\t' ''''],'collectoutput',1).'
hdr =
{1x1 cell}
>>
Then you get one cell but it contains a subarray of 14 elements so you address them as, for example,
>> hdr{1}
ans =
Columns 1 through 11
'tail' 'type_ref' 'datetime' 'seq' 'seq_no' 'sys_ref' 'ref' 'reg_ref' 'res_seq' 'ci_ref' 'chan_ref'
Columns 12 through 14
'component_ref' 'passband_ref' 'amp''
>> hdr{1}(1)
ans =
'tail'
>>
As for display, Matlab is not a spreadsheet. I think the newer releases with the table class have some presentation features but I have only to R2012b installed here so can only refer you to the documentation...
ADDENDUM I note that even the 'whitespace' argument didn't remove the trailing quote on the header line. That appears to be a bug in textscan or perhaps it's another one of those inscrutable C fscanf rules that it behaves "as expected" by the C Standard even though it's not what one would expect from the syntax.
I also tried the explicit format statement to ignore a specific character and it had the same result--
q=char(39); % ' character shorthand
fmt=[q repmat('%s',1,14) q]; % format string to skip leading/trailing "'"
It works for the first column header but not for the last; if that's an annoyance looks like you'll have to do a strrep on that element in the header (or, of course, fix the input file format)
Dan Howard
on 11 Feb 2016
Dan Howard
on 11 Feb 2016
Don't know what "working in terms of headings" means, sorry. What specific steps have you taken?
It still appears you're trying to turn Matlab into a spreadsheet -- what is the end objective you're trying to accomplish? If it's to do some analysis on the data, then keeping that data as numeric and doing whatever on it is the way to proceed. If it's to output for human consumption a printed or displayed table, that's something else again.
ADDENDUM On reflection, I presume you likely mean that you did get the header strings imported into a table and wonder about the data to go with it. I think the best thing since your data are mixed up between categorical and numeric is to fix the original file. Then you should be able to read it with readtable and manipulate it. I don't, however, have a version recent enough to include it so can't test all the way through. However, it's pretty simple to fix up the file if you can't create it correctly to begin with, just read it as is and remove the delimiting single quotes and rewrite...there are two ways to do this, one is to treat it as text and read as cell strings, the other will be faster on larger files just treats the content as bytes...
A)
f=textread('yourbadlyformatted.csv','%s','delimiter','\n'); % cell string array
f=strrep(f,'''',''); % remove the single quotes
fid=fopen('yournewcsvfile.csv','wt'); % make a new file
for i=1:size(f,1)
fprintf(fid,'%s\n',char(f(i));
end
fid=fclose(fid);
B) For really large files will be fast as can get and is somewhat less code
fid=fopen('yourbadlyformatted.csv','r'); % open the bum file
f=fread(fid,'*char'); % read as string of bytes
fclose(fid) % done with it
f(f=='''',''); % remove the quotes again
fid=fopen('newgoodfile.csv','w'); % get ready to write new one
fwrite(fid,f); % and write the remaining bytes
fid=fclose(fid); % done
Now use readtable on the new file with the appropriate formatting string and using the first record as the header line and you likely will have what you wish.
Dan Howard
on 12 Feb 2016
Edited: Dan Howard
on 12 Feb 2016
Dan Howard
on 12 Feb 2016
Edited: Dan Howard
on 12 Feb 2016
@Dan Howard: just a tip for using this forum: first select the entire code text, and then click the {} Code button. When you do this backwards you get this filler text:
if true
% code
end
which just makes your code harder to follow. Do you see that you have this text in all of your code? That filler text is not markup and is not required to display text as code. It is intended to show where code can go. (In fact all you really need is to put two spaces at the start of each code line, as this is all the button really does)
Dan Howard
on 12 Feb 2016
dpb
on 12 Feb 2016
You again don't show what you do, specifically, so I'm left guessing from the description which is difficult at best to infer from but...
A) If you use readtable then, no, you don't first try to read the same file with textscan.
B) "Option B" is simply a filter to first remove the surrounding quote mark in the original file so any file i/o function has a chance't to work. f is not fit for human consumption; it's just a string of bytes w/o those marks to be written to a new file which is then the one you want to use to try to read.
C) The attached file doesn't have those quote marks in it -- so is that because it was the real file and you pasted the former from somewhere else or was pasted from the command window and included the quote marks from Matlab that it puts around a cell string for display and so were never there to begin with? I'm beginning to think the latter. If that were the case, "Option B" will have no effect and therefore isn't needed. OTOH, if the "fixed" in the name implies it is a corrected file and there was another "unfixed" version with the quotes, then if shows the filter did its job and this is the file you should use for input as outlined.
D) I, unfortunately, do not have a release that includes the table class so I can't actually read the file that way here but the above file should be directly importable by it from my understanding of it by reading the doc's.
Dan Howard
on 15 Feb 2016
Dan Howard
on 18 Feb 2016
See Walter Roberson's Answer -- he recognized a problem I didn't as I just opened the above file in Excel to see whether it did or did not have the quote marks as given in your very first post. That encoding keeps the Matlab normal i/o functions from functioning correctly.
I can't answer why you're getting the strrep error; it works here w/o error but on trying on the actual file textread only returns a few characters instead of the whole file.
Looking back, I note that you used xlsread originally and successfully read the text; the problem is the data weren't separated by cell in Excel. Thus xlsread can read the file as it is stored so you can either use Walter's converter script/function or stick with an Excel solution.
Probably still the simplest since you have an Excel spreadsheet and your apparent level of familiarity with the tools (assuming you do have Excel and not just the file) is to use the import tool in it when you open the file with it and convert the single cell per row into a regular spreadsheet format and save that then read that file as a "real" .xls file. Then xlsread will return the values separated by column to be imported into the table class.
Or, once you've done the conversion, presuming as noted that it's not something about your OS/locale creating the specific file encoding, if you save the spreadsheet as a .csv file where each column is a cell, then the direct readtable solution should work.
Either way, you've got to work around the Matlab limitation for the file encoding as Walter discovered.
I'm presuming that the resulting file will not also be UTF-16 Little Endian here...if that's inherent in your system for some reason owing to a locale setting, or somesuch, you're going to have much difficulty with Matlab it would seem.
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!