Importing file with text and numbers

19 views (last 30 days)
Hale
Hale on 6 Jul 2013
Hi,
I'm trying to load a text file that contains both text and numbers into matlab. The first few lines of the text are shown below:
<<Time = 0.0494352
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]
Time = 0.0496235
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]
Time = 0.0498117
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]>>
This is a very long file where the data is given at each time step. I need to sort the time and the values for the fluxes. I tried textscan but it was unsuccessful.
I really appreciate any ideas and suggestions.
Thanks \Hale
  2 Comments
dpb
dpb on 6 Jul 2013
Is the blank line between data records real or a figment of the cut'n paste operation?
Hale
Hale on 7 Jul 2013
Edited: Hale on 7 Jul 2013
There are only blank lines between the previous and the new time step as you can on the screen shot above.

Sign in to comment.

Accepted Answer

dpb
dpb on 6 Jul 2013
OK, I don't have time to work thru a clever way at the moment but the following scans your sample file ok...
MATL
fid=fopen('hale.txt','rt');
t=[]; fw=[]; fa=[]; fo=[];
while ~feof(fid)
l=fgetl(fid);
if isempty(l),continue, end
if strfind(l,'Time'), t=[ t;sscanf(l,'Time = %f')];end
if strfind(l,'Flux at w'),fw=[fw;sscanf(l,'Flux at waterFlow = %f')];end
if strfind(l,'Flux at a'),fa=[fa;sscanf(l,'Flux at airFlowIn = %f')];end
if strfind(l,'Flux at o'),fo=[fo;sscanf(l,'Flux at outlet = %f')];end
end
fid=fclose(fid);
%
dat=[t fw fa fo];
clear t fw fa fo;
To improve performance on large files if this is too slow preallocate a reasonable size for the accumulating arrays and increment the indices w/ a counter. Either make the size larger than any file you'll want to read and then truncate when done to final sizes or you'll have to check and reallocate if exceed the initial size.
You might also help the above just a little if you were to return the index of the strfind() and only parse the string pieces needed...oh! can do that anyway since is a fixed format--just count the location and put the proper start point in the sscanf string...let's see--as an example for Time it would look like
MATL
if strfind(l,'Time'), t=[t;sscanf(l(7:end),'Time = %f')];end
Looks like for the fluxes you can't count on the same number of digits so that you would need to use the location past the '=' as start and then find the 'm' of 'm^3' and use the location one shorter than that as the substring end. That would eliminate the internal error that happens now when the i/o conversion scans until it fails by giving it a fixed string to convert that is a valid fp number. I suspect that would be noticeable on large files.
Salt to suit... :)
regexp() can undoubtedly also be made to work; how it'll be on performance in comparison I don't know, I'm too weak w/ regexp that I'm not even agonna' try.
  2 Comments
Hale
Hale on 7 Jul 2013
Edited: Hale on 7 Jul 2013
Thanks a lot for your detailed answer. My file contains about 17000 rows and the first way you suggested works actually very well. It takes about 4 seconds to get the data sorted.
dpb
dpb on 7 Jul 2013
Good...yeah, oftentimes on finds that the "deadahead" solution works well enough. I suspect if you were to preallocate you could get it down quite a bit more but 4 sec if that's the typical file size you'll be dealing with is probably acceptable.
But, it's pretty simple to implement...
MATL
...
N=20000; % initial alloc size
d=zeros(N,4);
ix=0;
while ~feof(fid)
l=fgetl(fid);
if isempty(l),continue, end
ix=ix+1; if ix>N, d=[d; zeros(N,4)]; N=N+N; end
if strfind(l,'Time'), d(ix,1)=[ t;sscanf(l,'Time = %f')];end
if strfind(l,'Flux at w'),d(ix,2)=[fw;sscanf(l,'Flux at waterFlow = %f')];end
....etc...
end
d(ix+1:end,:)=[]; % clean up empty end...

Sign in to comment.

More Answers (3)

the cyclist
the cyclist on 6 Jul 2013
If you have a relatively recent release of MATLAB, you can use the Import Data tool that is found on the Home tab of the Command Window.
You can read about it (and all kinds of other options for importing data) here:

Miroslav Balda
Miroslav Balda on 6 Jul 2013
The prwvious answer gives a possible solution, however the function fgetl is rather slow. Maybe, the alternative way is in application of the function
ffread www.mathworks.com/matlabcentral/fileexchange/9034
The function serves for free-format reading of ascii files. The read lines can be analyzed after the file is read. Good luck.
Mira
  1 Comment
dpb
dpb on 7 Jul 2013
I'm not sure what that particular FEX submission actually does, but one can read the whole file in one big slurp (assuming will all fit in memory) w/ fread() as character array and then only loop thru the records in memory is desired.

Sign in to comment.


per isakson
per isakson on 6 Jul 2013
Edited: per isakson on 6 Jul 2013
If the file fits in memory this is one way to read it.
Maybe, '\r\n', needs to be replaced by '\n'. That depends the source of the file. Or replace '\r\n' by '[\r]*\n' to handle both cases with the same code.
Next step is to decide what data shall be kept and in what data structures.
Replace disp( ca2{jj} ) by code that parses one line at a time. See dpb's answer.
Try
function cssm()
str = fileread( 'blocks.txt' );
ca1 = regexp( str, '\r\n(?=Time)', 'split' );
len = length( ca1 );
% use len to allocate memory for variables to store data.
for ii = 1 : length( ca1 )
ca2 = regexp( ca1{ii}, '\r\n', 'split' );
for jj = 1 : length( ca2 )
disp( ca2{jj} )
end
end
end
returns
Time = 0.0494352
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]
Time = 0.0496235
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]
Time = 0.0498117
Patch: waterFlow found on 1/1 processor(s)
Flux at waterFlow = -0.0125m^3/s [-750 l/min]
Patch: airFlowIn found on 1/1 processor(s)
Flux at airFlowIn = 0.0125345m^3/s [752.073 l/min]
Patch: outlet found on 1/1 processor(s)
Flux at outlet = -3.45519e-05m^3/s [-2.07311 l/min]
  5 Comments
per isakson
per isakson on 8 Jul 2013
Edited: per isakson on 8 Jul 2013
The answers to a question will ideally provide a little "smorgasbord". I offer one small dish, without too much thought.
I hope that more than one reader will benefit from the "smorgasbord".
The doc of R2012a says:
[...]To open files in text mode, attach the letter 't' to the permission,
such as 'rt' or 'wt+'. For better performance, do not use text mode.[...]
A long time ago I ceased using the 't' because of the performance penalty. I've kind of forgotten that it exists.
dpb
dpb on 8 Jul 2013
Edited: dpb on 9 Jul 2013
Hmmm...R2012b (doc) says
To open files in text mode, attach the letter 't' to the permission, such as 'rt' or 'wt+'.
For better performance, do not use text mode. The following applies on Windows systems, in text mode: ...
This additional processing is unnecessary for most cases. All MATLAB import functions, and most text editors (including Microsoft Word and WordPad), recognize both '\r\n' and '\n' as newline sequences. However, when you create files for use in Microsoft Notepad, end each line with '\r\n'. ...
I have only recently been blessed by TMW w/ an update to 2012b (from R12) which doesn't have anything specific about the performance hit and has the warning
... To open in text mode, add "t" to the permission string, for example 'rt' and 'wt+'. (On Unix, text and binary mode are the same so this has no effect. But on PC systems this is critical.)
I'm of the age when it was indeed the case that much Windows software including my favorite programmers' editor didn't deal w/ the non-Windows \n sequence at all gracefully so I just continue to operate in that mode.
I guess I'll have to update my thinking/advice for Matlab specifically and let users run into their own quirks w/ other packages if they still aren't graceful.
I do see that TMW ought then to update the help text for fopen() to be more consistent as it still has the same verbiage as does R12.1 and no real indication of any real performance hit.
From R2012b session...
MATL
>> help fopen
fopen Open file.
...
You can open files in binary mode (the default) or in text mode.
In binary mode, no characters get singled out for special treatment.
In text mode on the PC, the carriage return character preceding
a newline character is deleted on input and added before the newline
character on output. To open a file in text mode, append 't' to the
permission string, for example 'rt' and 'w+t'. (On Unix, text and
binary mode are the same, so this has no effect. On PC systems
this is critical.)
So, I'll modify my warnings if TMW will fix help... :)

Sign in to comment.

Categories

Find more on Performance and Memory in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!