What is the efficient way to parse file without fscanf or testscan
19 views (last 30 days)
Show older comments
Note: the code I show in this question was already quoted in different question: how to extract data from the fixed-width-field format using fscanf or textscan but there the matter of the quesiton was slightly differnt.
The problem:
My experiments showed, that use of fscanf seems to be drammatically faster, then fgetl. More specifically I tried to do very simple profiling with the file, which has 35597 lines. The results were very unexpected for me. The code, which uses fscanf needed just 0.253614 seconds to complete.
function [data] = test_fscanf_nodes_only_01()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
data = fscanf(file_id, '%8d%16f%16f%16f%8f%8f', [6, Inf]);
end
while the following code with fgetl needs 6.343209 seconds to complete, even though it does much less
function [data] = test_fscanf_nodes_only_02()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
lines_count = 0;
while ~feof(file_id)
current_line = fgetl(file_id);
lines_count = lines_count + 1;
end
data = 1;
fprintf('Lines count: %d', lines_count);
end
Just for cormparison, the following code in Python runs in 0.021914958953857422.
import time
FILE_NAME = 'myfile.txt'
def main():
lines_count = 0
with open(FILE_NAME, 'r') as input_file:
for line in input_file:
lines_count += 1
print(lines_count)
if __name__ == '__main__':
start = time.time()
main()
end = time.time()
print(end - start)
I can accept, that it runs 10 times faster, than the code with fscanf, because it does not do actual parsing. But I cannot understand, how can it be, that it is 300 times faster, then the code, which does exactly the same in matlab.
My further experiments showed, that if I read the whole file in memory and just search for the EOL (and count lines), it takes, just about 0.1 seconds. But as soon as I try to return a stiring for each line, the time gets close to the one, which I receive with fgetl.
My questions:
- Is it expected, that fgetl is so slow? (compared to fscanf)
- If I cannot do the parsing with fscanf (e.g. because the structure of the strings does not allow it), are there some other "fast" ways to do parsing?
- Is it correct, that it is string object creation, which introduces this high performance penalty into work of fgetl, or is it something else?
EDIT: fix dimension in a call to fscanf to [Inf, 6] according to the comment from Jan.
2 Comments
Accepted Answer
Jan
on 16 Jun 2019
Edited: Jan
on 16 Jun 2019
I've created a test file at first:
fid = fopen('myfile.txt', 'w');
fprintf(fid,'%.0g %16f %16f %16f %8f %8f\n', rand(6, n));
flcose(fid);
Now I run your first code with " [6, Inf]" instead of "[5, Inf]". Under Win10, Matlab R2018b it needs 0.55 sec on my machine.
The 2nd code with fgetl needs 2.9 sec on my machine. As soon as I open the file in 'r' mode instead of 'rt', the runtime is reduced to 0.4 sec. This looks like fgetl is less efficient in the text mode.
The runtime is only 0.35 sec, when I omit the feof:
current_line = 'dummy';
while ischar(current_line)
current_line = fgetl(file_id);
lines_count = lines_count + 1;
end
With fgets instead of fgetl it is reduced to 0.24 sec. Look into the M-code of fgetl:
type fgetl
to see the difference.
More Answers (0)
See Also
Categories
Find more on String Parsing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!