MATLAB Answers

0

What is the efficient way to parse file without fscanf or testscan

Asked by Dmitrii Semikin on 16 Jun 2019
Latest activity Edited by Dmitrii Semikin on 17 Jun 2019
Accepted Answer by Jan
Note: the code I show in this question was already quoted in different question: how to extract data from the fixed-width-field format using fscanf or textscan but there the matter of the quesiton was slightly differnt.
The problem:
My experiments showed, that use of fscanf seems to be drammatically faster, then fgetl. More specifically I tried to do very simple profiling with the file, which has 35597 lines. The results were very unexpected for me. The code, which uses fscanf needed just 0.253614 seconds to complete.
function [data] = test_fscanf_nodes_only_01()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
data = fscanf(file_id, '%8d%16f%16f%16f%8f%8f', [6, Inf]);
end
while the following code with fgetl needs 6.343209 seconds to complete, even though it does much less
function [data] = test_fscanf_nodes_only_02()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
lines_count = 0;
while ~feof(file_id)
current_line = fgetl(file_id);
lines_count = lines_count + 1;
end
data = 1;
fprintf('Lines count: %d', lines_count);
end
Just for cormparison, the following code in Python runs in 0.021914958953857422.
import time
FILE_NAME = 'myfile.txt'
def main():
lines_count = 0
with open(FILE_NAME, 'r') as input_file:
for line in input_file:
lines_count += 1
print(lines_count)
if __name__ == '__main__':
start = time.time()
main()
end = time.time()
print(end - start)
I can accept, that it runs 10 times faster, than the code with fscanf, because it does not do actual parsing. But I cannot understand, how can it be, that it is 300 times faster, then the code, which does exactly the same in matlab.
My further experiments showed, that if I read the whole file in memory and just search for the EOL (and count lines), it takes, just about 0.1 seconds. But as soon as I try to return a stiring for each line, the time gets close to the one, which I receive with fgetl.
My questions:
  • Is it expected, that fgetl is so slow? (compared to fscanf)
  • If I cannot do the parsing with fscanf (e.g. because the structure of the strings does not allow it), are there some other "fast" ways to do parsing?
  • Is it correct, that it is string object creation, which introduces this high performance penalty into work of fgetl, or is it something else?
EDIT: fix dimension in a call to fscanf to [Inf, 6] according to the comment from Jan.

  2 Comments

Can you provide some code, which produces the input file?
This is strange:
data = fscanf(file_id, '%8d%16f%16f%16f%8f%8f', [5, Inf]);
You have 6 format specifiers and read a [5 x inf] matrix?
@Jan: Of course, you are right, the dimensions should be [6, Inf]. But it should not affect the validity of the question.

Sign in to comment.

Tags

Products


Release

R2015b

1 Answer

Answer by Jan
on 16 Jun 2019
Edited by Jan
on 16 Jun 2019
 Accepted Answer

I've created a test file at first:
fid = fopen('myfile.txt', 'w');
fprintf(fid,'%.0g %16f %16f %16f %8f %8f\n', rand(6, n));
flcose(fid);
Now I run your first code with " [6, Inf]" instead of "[5, Inf]". Under Win10, Matlab R2018b it needs 0.55 sec on my machine.
The 2nd code with fgetl needs 2.9 sec on my machine. As soon as I open the file in 'r' mode instead of 'rt', the runtime is reduced to 0.4 sec. This looks like fgetl is less efficient in the text mode.
The runtime is only 0.35 sec, when I omit the feof:
current_line = 'dummy';
while ischar(current_line)
current_line = fgetl(file_id);
lines_count = lines_count + 1;
end
With fgets instead of fgetl it is reduced to 0.24 sec. Look into the M-code of fgetl:
type fgetl
to see the difference.

  1 Comment

Hello Jan,
Thank you for your answer. I've tried it in my environment and indeed, if the file is open in "r" mode, then fgetl operates about 10 times faster, then if I open the file in "rt" mode (and fgets is about 20 times faster).
I see now in the documentation to fgetl:
'''
Open or create a new file in text mode if you want to write to it in MATLAB and then open it in Microsoft® Notepad, or any text editor that does not recognize '\n' as a newline sequence. When writing to the file, end each line with '\r\n'. For an example, see fprintf. Otherwise, open files in binary mode for better performance.
'''
But honestly speaking the 10x difference is a bit frustrating for me...

Sign in to comment.