Re-synchronizing TEXTSCAN

Question

dpb on 23 Dec 2016

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/318004-re-synchronizing-textscan

Edited: Kirby Fears on 5 Jan 2017

There are any number of questions posed for reading irregularly or segmented text files on the forum. Often the response is "use textscan in a loop" but there are some issues there that also continually arise. A case of my own just now has raised a particular one that prompts the present query...

The file in question is tab-delimited, daily weather records with a header line for each day... almost perfect fodder for readtable excepting the header line also contains the month and day over the time columns which isn't header but data that breaks the builtin solution. A sample of the file is

Dec 15   Temperature   Dew Point   Humidity   Wind   Speed   Gust   Pressure   Precip. Rate.   Precip. Accum.
12:10 AM   12.5 °F   2 °F   64 %   ESE   0 mph   3 mph   30.7 in   0 in   0 in
12:25 AM   12.1 °F   3 °F   66 %   ESE   0 mph   1 mph   30.69 in   0 in   0 in
...
11:44 PM   27.5 °F   23 °F   84 %   ESE   2 mph   9 mph   30 in   0 in   0 in
11:58 PM   27.5 °F   23 °F   84 %   ESE   2 mph   7 mph   29.98 in   0 in   0 in
Dec 16   Temperature   Dew Point   Humidity   Wind   Speed   Gust   Pressure   Precip. Rate.   Precip. Accum.
12:13 AM   27.5 °F   24 °F   85 %   SE   1 mph   5 mph   29.98 in   0 in   0 in
12:28 AM   27.8 °F   24 °F   86 %   East   3 mph   5 mph   29.96 in   0 in   0 in
...

The following code snippet successfully reads and returns the numeric data in an array

nHdr=0;                                 % fixup for header lines to skip/not...
fid=fopen(fn,'r');
l=fgetl(fid);                           % get first header line
C=textscan(l,'%s',1,'delimiter','\t');  % get month day string
while ~feof(fid)
  data=textscan(fid,fmt,'collectoutput',1,'headerlines',nHdr,'delimiter','\t');  % read body of data
  L2=size(data{2},1);                   % how many lines found in this group of numeric data?
  dn=datenum([repmat(['2016' char(C{:})],L2,1) char(data{1}(1:L2))],'yyyymmmdd HH:MM AM');  % convert times
  if nHdr==0, nHdr=1; end   % kludge to re-synch the file marker after failure leaves in mid-record
  if size(data{1},1)>size(data{2},1)   % another fixup to get rid of extra record of subsequent day
    C=([data{1}(end,:)]);              % ok, is another group going to come, get the month/day
  end
  [~,ix]=ismember(data{3},wdirs);      % this just converts the alpha wind dir to numeric for convenience
  wdir=360-(ix-1)*22.5;
  dd=[dd;[dn data{2} wdir data{4}]];   % and mush all numeric together in one long array
end
fid=fclose(fid);

As can be seen, there's a lot of fixup needs be done and the above is only as "clean" as it is owing to the fact that the first column is a string variable and the format of the time column is also a single string so the first cell array holds an extra element when there is another block of data extant in the file; if the data format were of other format this wouldn't work, either.

So, with that as preamble, the question is--

Why is there not reliable way to "re-synch" textscan (and friends using file handles) to beginning of record? If there were, then the above machinations and similar ones undertaken for so many of the aforementioned other special cases we see at Answers would become trivial; when the textscan operation fails, an instruction to reset the file position indicator to beginning of the record would then let the next loop iteration issue a "clean" repeat of the same, identical call. The obvious syntax would be something like that of fseek but with a system-sensitive number of records instead of bytes.

One can use in this case fseek and go back some 6 or 8 bytes but it's empirical because the number of characters in the field isn't consistent whereas the i/o subsystem should be able to find the record terminator essentially trivially.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Kirby Fears on 23 Dec 2016

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/318004-re-synchronizing-textscan#answer_248329

Edited: dpb on 23 Dec 2016

Open in MATLAB Online

dpb,

I made a generalized solution to this problem some time ago for the constant questions about parsing delimited text files. It uses text scan to read each row as a string, then split according to the given delimiter into a row of strings. If you request numeric or mixed output, the function attempts to convert each entry to a number and does not convert cells that cannot be interpreted as a number. This means the user does not need to specify a format string or even know the NxM dimensions of the file.

https://www.mathworks.com/matlabcentral/fileexchange/52423-delimread

result = delimread('test.txt','\t',{'raw','mixed'});

In your example, none of the degree data is directly convertible to numbers; check out result.mixed when test.txt does contain mixed types. You can use document-specific reasoning to drop the rows with repeated headers afterward. Does this address all of the use cases you had in mind?

7 Comments
Show 5 older commentsHide 5 older comments

dpb on 5 Jan 2017

Thanks, Kirby. Over the holidays I did download R2014b which will run (albeit fairly slowly but far better than I had expected) on this old hardware to which I'm currently limited so can 'spearmint from there.

I still think having a way to get textread back on track would be aGoodThing (tm) as a general facility.

Kirby Fears on 5 Jan 2017

Edited: Kirby Fears on 5 Jan 2017

I totally agree that a new utility (like revamped textread) should be included in future releases. The existing functions don't combine the 3 required aspects very well: (1) easy API, (2) capable of reading wide variety of quirky delimited files, and (3) fast.

textscan is fully flexible and fast, but the API is opaque for most users. It also tends to require post-read manipulations. readtable is easy and fast but not as flexible as textscan. Most users I work with choose xlsread to get (1) and (2) at the expense of speed. The delimread function is much faster than xlsread with similar usability and flexibility.

The delimread parsing logic could be used in a c/mex-based function to make it faster while providing an easy API to read virtually any delimited file. Certainly aGoodThing (tm) for future Matlab releases.

Sign in to comment.

Re-synchronizing TEXTSCAN

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

7 Comments
Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Re-synchronizing TEXTSCAN

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

7 Comments Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

7 Comments
Show 5 older commentsHide 5 older comments