retrieve data containing date and time from text file

Hi all,
I encountered the following strange problem when trying to retrieve date and time and other data from a text string.
The following runs OK.
>>tmp = '460018898104779 6d0502fb0be52d 2018-09-27 23:59:48';
>>test2 = textscan(tmp, '%f%14c%{yyyy-MM-dd HH:mm:ss}D', 'Delimiter',{'\n'});
But when I append two more doubles to the text string and run
>>tmp = '460018898104779 6d0502fb0be52d 2018-09-27 23:59:48 113.944081 22.504949';
>>test2 = textscan(tmp, '%f%14c%{yyyy-MM-dd HH:mm:ss}D%f%f', 'Delimiter',{'\n'});
It gives the following error message:
>>Error using textscan
>>Unable to read the DATETIME data with the format 'yyyy-MM-dd HH:mm:ss'. If the data is not a time, use %q to get text data.
Could somebody help to clarify what has gone wrong? Thank you very much!
Bayern

 Accepted Answer

You should use Delimiter '\t' rather than \n

8 Comments

That works! Thanks a lot Walter! But could you also explain the reason/the difference between /t and /n delimiter? After all I have done nothing but append two doubles to the string.
\n is newline. Neither of those strings contained newline, so you were effectively setting delimiter to end of string.
%f without a width reads as many characters as it can until what it sees is no longer a valid number, and then if not at end of line backs up to the last place that was the end of a valid number. %f recognizes both d and e as exponents, but only when followed by digits or + or - followed by digits for the exponent; otherwise it backs up. The tab in your string is not a valid digit so it knows to end reading the number there. But if you look at your posting, because tabs do not show up in the display, it looks like your line starts with
4600188981047796d0502fb0be52d
in which case it would read right through the 4600188981047796d0502 and then encounter the 'f', decide that f was not valid because you were already in an exponent (the 'd') and so would parse the number as 4600188981047796e+502 which overflows to infinity. It would then be left pointing to the 'f' in the buffer in that case.
Anyhow, after the %f you are positioned at the tab and are processing %14c . Because you did not explicitly set the Whitespace to exclude tab, the %c operation will skip forward over any whitespace at that point, and will read 14 characters starting from that point, which would leave you positioned at the second tab.
Now you hit the %D specification with an included blank. %D format with an included blank acts a bit oddly with respect to the Delimiter and Whitespace options. In all of the formats except %c or %[], when you encounter any character in the Delimiter list or the Whitespace list, the scanning of the field is to stop. But for %D format, it looks like the rules have been changed (not sure which release) so that scanning can continue if you encounter Whitespace but not if you encounter Delimiter -- and Delimiter normally includes space. So
textscan('2018-09-27 23:59:48', '%{yyyy-MM-dd HH:mm:ss}D')
is an error because the default Delimiter includes space and so scanning stops at the space between the two parts, and the part of the date collected does not match the format. And
textscan('2018-09-27 23:59:48', '%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', '\n')
fails as well, because space is still in Delimiters list. But
textscan('2018-09-27 23:59:48', '%{yyyy-MM-dd HH:mm:ss}D', 'delimiter', '\n')
works because space is no longer in delimiter so somehow it does not terminate scanning of the %D field.
Though to be honest, my hypothesis does not explain why
textscan('2018-09-27 23:59:48', '%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', '')
works.
Anyhow, with Delimiter set to \n, scanning of the %D is not stopping with the tab after the time portion and it is continuing on and pulling in the numeric field and the tab after that and the numeric field after that too, until it hits end of string, and then sending all of that to be matched against the %D
With Delimiter set to \t then it sees the tab after the time portion, stops scanning, passes in the date and space and time to the %D, gets a match, and is left positioned at the tab for processing of further fields.
Woo~What a detailed diagnosis! Thank you so much Walter! I learned much more than what I expected! Hope other guys could come across this thread and learn the many scanning rules!
On the last example, it works because the one time textscan doesn't add the gratuitous blank into the whitespace string is when it is specified as the empty string and there's a format scan field given. Hence, the whitespace in the above does not include a blank and there's only one field so don't need one (a delimiter, that is).
Modify the example a little to have two fields to try to parse as:
s='3.14,2018-09-27 23:59:48';
>> textscan(s, '%f%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', '')
Error using textscan
Unable to read the DATETIME data with the format 'yyyy-MM-dd HH:mm:ss'. If the data is not a time, use %q to
get text data.
fails expectedly because there's no delimiter to separate the two from each other.
textscan(s, '%f%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', ',')
Error using textscan
Unable to read the DATETIME data with the format 'yyyy-MM-dd HH:mm:ss'. If the data is not a time, use %q to
get text data.
This is the conundrum--you need to specify the delimiter but textscan thinks it's smarter than you are so it adds a blank to the comma and then your point above kicks in.
We can show the above is the problem by not having a blank but another delimiter in the string--
s(s==' ')=char(9);
textscan(s, '%f%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', ',')
ans =
1×2 cell array
{[3.1400]} {[2018-09-27 23:59:48]}
>> ans{:}
ans =
3.1400
ans =
datetime
2018-09-27 23:59:48
>>
and voila! now since there isn't a blank the date field is again parsed. I'm a little nonplussed myself here though that didn't have to fix up the %D format string; it still worked with the tab instead of blank once we got past the input parsing to return the full string.
It still would have been so much easier if TMW had kept the Fortran FORMAT model for formatted i/o instead of C's model.
You're welcome...there's lots to explore here!
Actually, I realized the correct way for the last example is
>> textscan(s, '%f%{yyyy-MM-dd HH:mm:ss}D', 'whitespace', '','delimiter',',')
ans =
1×2 cell array
{[3.1400]} {[2018-09-27 23:59:48]}
>>
Use the null whitespace but keep the delimiter.
There are still probably real records that can't be parsed but again, thanks for the spelunking and link; it did make me realize what I'd long overlooked and is a real service to the overall understanding of the intricacies inherent in textscan and particularly when compounded with the datetime object/class.
Many thanks for digging deep into the format dpb! You have made the thread more helpful!
NB: I corrected original mistake of 'whitespace' for 'delimiter' in first paragraph regarding textscan propensity to add the blank even if not user-provided.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 29 Sep 2018

Edited:

dpb
on 4 Oct 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!