Extract rectangular data from a non-rectangular file with header and convert to a structure of column vectors where field names are the second row of the rectangular data

2 views (last 30 days)
I am trying to read a text file that has a header of varying length due to some options that can be turned on. Below the header is rectangular data.
The first row of the rectangular data is unimportant to me and can be removed. The second row contains information that corresponds with the columns below it. I would like each of the strings in the second row of rectangular data to become field names for my structure.
Then I would like the corresponding numbers in the columns of data from the third line of rectangular data until the end to be vectors that are added into each field.
I have attached a shortened sample file that I am trying to perform this on to no avail. The actual data file has 172 columns (this can vary depending on the parameters selected) and is ~50k rows long (can also vary). I have tried writing a loop using fgetl and strsplit, which seems to be a usable option, but it is incredibly slow. Textscan seems to be a much faster option, but I am really struggling to figure out how to use its options to make this work.
So far, I don't have much working with textscan.
fid = fopen('sample_text.txt');
C = textscan(fid,'%*s','Delimiter', '\n','CollectOutput', true);
fclose(fid);
Right now, this returns an empty array, and I'm not quite sure what it is actually doing. I just pulled it from the example on using textscan for non-rectangular data. Any help or direction would be much appreciated.
  6 Comments
Shawn
Shawn on 19 Oct 2017
I have used the profiler. It pops up if you click run and time, as well. I don't like it because I find it incredibly hard to head. I have attached the outputs from running version 2 of my code using your method and version 3 using my method. I typically use tic and toc to compare segments of code as I find that method much easier to parse. The results of the profiler do show that much of the time is spent making calls to strsplit. I also noticed a good 6-7 seconds of variability in timing between runs, which I find surprising for something that is on the order of 25 seconds to run.
The comparison is somewhat skewed since you have not presented a method to find the number of header lines. It is just hard-coded, so I had to use my method for that. The comparison simply amounts to replacing
%open the file as read only
fid = fopen(fname,'r');
%read the rest of the file after the headerlines into a cell array as
%floats and then convert to a matrix
valmat=cell2mat(textscan(fid, repmat('%f',1,maxcol),'Headerlines',hlines));
%close the file
fclose(fid);
with
% - Parse file.
fId = fopen(fname,'r') ;
for k = 1 : (hlines-1)
fgetl(fId) ;
end
vars = lower(regexp(fgetl(fId), '\S+', 'match')) ;
data = reshape(fscanf( fId, '%f', Inf), numel(vars), []).' ;
fclose(fId) ;
I have also attached the profiler output for hardcoding the number of headerlines, which only takes about 6 seconds.
The slowest line of code in finding the headerlines is
cellsize=cell2mat(cellfun(@length,cellfun(@strsplit,celltext{1,1},'UniformOutput',false),'UniformOutput',false));
Cedric
Cedric on 20 Oct 2017
If your source text data file is not too confidential, you can send it to me (you got my email each time I sent you a message indicating that I posted a comment), and I can see quickly if I can speedup the processing.

Sign in to comment.

Accepted Answer

Cedric
Cedric on 10 Oct 2017
Edited: Cedric on 10 Oct 2017
The problem is that your file has discrepancies. If you look at the first row of data, it is missing an e before the +04 in the 5th column:
If it isn't a mistake due to a copy "by hand" for building an example, you could pre-process the content to correct for discrepancies before calling e.g. TEXTSCAN:
content = fileread('sample_text.txt') ;
content = regexprep( content, '(?<=\d)+', 'e+' ) ;
C = textscan( content, '%f%f%f%f%f%f', 'CollectOutput', true, 'HeaderLines', 5 ) ;
Now C should have a correct content.
  7 Comments
Shawn
Shawn on 11 Oct 2017
Thanks for all of the help. This is great. You are correct that it was a silly assignment error and the strrep command was working as you showed. Everything is working now. I will be trying to speed it up and may have additional questions whenever I get around to it, but I think that the asked question has been answered.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!