Extract rectangular data from a non-rectangular file with header and convert to a structure of column vectors where field names are the second row of the rectangular data
4 views (last 30 days)
I am trying to read a text file that has a header of varying length due to some options that can be turned on. Below the header is rectangular data.
The first row of the rectangular data is unimportant to me and can be removed. The second row contains information that corresponds with the columns below it. I would like each of the strings in the second row of rectangular data to become field names for my structure.
Then I would like the corresponding numbers in the columns of data from the third line of rectangular data until the end to be vectors that are added into each field.
I have attached a shortened sample file that I am trying to perform this on to no avail. The actual data file has 172 columns (this can vary depending on the parameters selected) and is ~50k rows long (can also vary). I have tried writing a loop using fgetl and strsplit, which seems to be a usable option, but it is incredibly slow. Textscan seems to be a much faster option, but I am really struggling to figure out how to use its options to make this work.
So far, I don't have much working with textscan.
fid = fopen('sample_text.txt');
C = textscan(fid,'%*s','Delimiter', '\n','CollectOutput', true);
Right now, this returns an empty array, and I'm not quite sure what it is actually doing. I just pulled it from the example on using textscan for non-rectangular data. Any help or direction would be much appreciated.
Cedric Wannaz on 10 Oct 2017
Edited: Cedric Wannaz on 10 Oct 2017
The problem is that your file has discrepancies. If you look at the first row of data, it is missing an e before the +04 in the 5th column:
If it isn't a mistake due to a copy "by hand" for building an example, you could pre-process the content to correct for discrepancies before calling e.g. TEXTSCAN:
content = fileread('sample_text.txt') ;
content = regexprep( content, '(?<=\d)+', 'e+' ) ;
C = textscan( content, '%f%f%f%f%f%f', 'CollectOutput', true, 'HeaderLines', 5 ) ;
Now C should have a correct content.