Re-synchronizing TEXTSCAN

11 views (last 30 days)
dpb
dpb on 23 Dec 2016
Edited: Kirby Fears on 5 Jan 2017
There are any number of questions posed for reading irregularly or segmented text files on the forum. Often the response is "use textscan in a loop" but there are some issues there that also continually arise. A case of my own just now has raised a particular one that prompts the present query...
The file in question is tab-delimited, daily weather records with a header line for each day... almost perfect fodder for readtable excepting the header line also contains the month and day over the time columns which isn't header but data that breaks the builtin solution. A sample of the file is
Dec 15 Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum.
12:10 AM 12.5 °F 2 °F 64 % ESE 0 mph 3 mph 30.7 in 0 in 0 in
12:25 AM 12.1 °F 3 °F 66 % ESE 0 mph 1 mph 30.69 in 0 in 0 in
...
11:44 PM 27.5 °F 23 °F 84 % ESE 2 mph 9 mph 30 in 0 in 0 in
11:58 PM 27.5 °F 23 °F 84 % ESE 2 mph 7 mph 29.98 in 0 in 0 in
Dec 16 Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum.
12:13 AM 27.5 °F 24 °F 85 % SE 1 mph 5 mph 29.98 in 0 in 0 in
12:28 AM 27.8 °F 24 °F 86 % East 3 mph 5 mph 29.96 in 0 in 0 in
...
The following code snippet successfully reads and returns the numeric data in an array
nHdr=0; % fixup for header lines to skip/not...
fid=fopen(fn,'r');
l=fgetl(fid); % get first header line
C=textscan(l,'%s',1,'delimiter','\t'); % get month day string
while ~feof(fid)
data=textscan(fid,fmt,'collectoutput',1,'headerlines',nHdr,'delimiter','\t'); % read body of data
L2=size(data{2},1); % how many lines found in this group of numeric data?
dn=datenum([repmat(['2016' char(C{:})],L2,1) char(data{1}(1:L2))],'yyyymmmdd HH:MM AM'); % convert times
if nHdr==0, nHdr=1; end % kludge to re-synch the file marker after failure leaves in mid-record
if size(data{1},1)>size(data{2},1) % another fixup to get rid of extra record of subsequent day
C=([data{1}(end,:)]); % ok, is another group going to come, get the month/day
end
[~,ix]=ismember(data{3},wdirs); % this just converts the alpha wind dir to numeric for convenience
wdir=360-(ix-1)*22.5;
dd=[dd;[dn data{2} wdir data{4}]]; % and mush all numeric together in one long array
end
fid=fclose(fid);
As can be seen, there's a lot of fixup needs be done and the above is only as "clean" as it is owing to the fact that the first column is a string variable and the format of the time column is also a single string so the first cell array holds an extra element when there is another block of data extant in the file; if the data format were of other format this wouldn't work, either.
So, with that as preamble, the question is--
Why is there not reliable way to "re-synch" textscan (and friends using file handles) to beginning of record? If there were, then the above machinations and similar ones undertaken for so many of the aforementioned other special cases we see at Answers would become trivial; when the textscan operation fails, an instruction to reset the file position indicator to beginning of the record would then let the next loop iteration issue a "clean" repeat of the same, identical call. The obvious syntax would be something like that of fseek but with a system-sensitive number of records instead of bytes.
One can use in this case fseek and go back some 6 or 8 bytes but it's empirical because the number of characters in the field isn't consistent whereas the i/o subsystem should be able to find the record terminator essentially trivially.

Answers (1)

Kirby Fears
Kirby Fears on 23 Dec 2016
Edited: dpb on 23 Dec 2016
dpb,
I made a generalized solution to this problem some time ago for the constant questions about parsing delimited text files. It uses text scan to read each row as a string, then split according to the given delimiter into a row of strings. If you request numeric or mixed output, the function attempts to convert each entry to a number and does not convert cells that cannot be interpreted as a number. This means the user does not need to specify a format string or even know the NxM dimensions of the file.
result = delimread('test.txt','\t',{'raw','mixed'});
In your example, none of the degree data is directly convertible to numbers; check out result.mixed when test.txt does contain mixed types. You can use document-specific reasoning to drop the rows with repeated headers afterward. Does this address all of the use cases you had in mind?
  7 Comments
dpb
dpb on 5 Jan 2017
Thanks, Kirby. Over the holidays I did download R2014b which will run (albeit fairly slowly but far better than I had expected) on this old hardware to which I'm currently limited so can 'spearmint from there.
I still think having a way to get textread back on track would be aGoodThing (tm) as a general facility.
Kirby Fears
Kirby Fears on 5 Jan 2017
Edited: Kirby Fears on 5 Jan 2017
I totally agree that a new utility (like revamped textread) should be included in future releases. The existing functions don't combine the 3 required aspects very well: (1) easy API, (2) capable of reading wide variety of quirky delimited files, and (3) fast.
textscan is fully flexible and fast, but the API is opaque for most users. It also tends to require post-read manipulations. readtable is easy and fast but not as flexible as textscan. Most users I work with choose xlsread to get (1) and (2) at the expense of speed. The delimread function is much faster than xlsread with similar usability and flexibility.
The delimread parsing logic could be used in a c/mex-based function to make it faster while providing an easy API to read virtually any delimited file. Certainly aGoodThing (tm) for future Matlab releases.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!