Quicker way from for loop for reading columns from different csv files in the same folder

1 view (last 30 days)
Hello to everyone,
I have a folder that entails a large number of files lets say 10600 csv files (useful_stator_files). Each csv file entails a large number of columns (about 100 lets say) the length of rows is variable from 10 to 60. I am using the code under:
for stupid_k=1:(length(useful_source_files_stator))
final_path_stator{stupid_k}= useful_source_files_stator(stupid_k).name; %take the name of the final path depending on the csv files i kept
[SIR_AVG, SIR_MIN, SIR_MAX] = csvimport(sprintf('%s%s', source_dir,'\',final_path_stator{stupid_k}), 'columns', {'RES_INS_STA_LOG_AVG_PRI','RES_INS_STA_LOG_MIN_PRI','RES_INS_STA_LOG_MAX_PRI'}, 'noHeader', false, 'delimiter', ',' ); % READ THE 3 COLUMNS FROM EACH USEFUL FILE
[SPEED_AVG, SPEED_MIN, SPEED_MAX] = csvimport(sprintf('%s%s', source_dir,'\',final_path_stator{stupid_k}), 'columns', {'SPD_ACT_LOG_AVG_PRI','SPD_ACT_LOG_MIN_PRI','SPD_ACT_LOG_MAX_PRI'}, 'noHeader', false, 'delimiter', ',' ); % READ THE 3 COLUMNS FROM EACH USEFUL FILE
[Date_Time] = csvimport(sprintf('%s%s', source_dir,'\',final_path_stator{stupid_k}), 'columns', {'Date_Time_ms'}, 'noHeader', false, 'delimiter', ',' ); % READ COLUMN FROM EACH USEFUL FILE
Big_SIR_AVG{:,stupid_k}= SIR_AVG; % update big matrix
Big_SIR_MIN{:,stupid_k}= SIR_MIN; % update big matrix
Big_SIR_MAX{:,stupid_k}= SIR_MAX; % update big matrix
Big_SPEED_AVG{:,stupid_k}= SPEED_AVG; % update big matrix
Big_SPEED_MIN{:,stupid_k}= SPEED_MIN; % update big matrix
Big_SPEED_MAX{:,stupid_k}= SPEED_MAX; % update big matrix
Big_Date_Time{:,stupid_k}= Date_Time;
end
I have a stable path(source dir) and a path that changes(final path), i get inside each file and i get the columns i want and i finally keep them in cell arrays since they are other double vectors or string vectors. CSV import function, i took it from here:
It works but all this needs a lot of time, i also am taking another 4-5 signals apart from the 7 that i wrote in the code but that is the idea.

Answers (1)

Guillaume
Guillaume on 4 Dec 2015
Parsing 10600 text files is always going to be slow, particularly on Windows which probably struggles with that many files in a single directory. File I/O is probably the major bottleneck in what you're doing and there's not much you can do about it short of using a more efficient form of storage for your data.
Parsing the same files three times (three calls to csvimport per file) is certainly not going to help. There's no guarantee that csvimport code has been written optimally either (certainly after a quick look, the file reading part isn't efficient). You would be much better off using csvread (comes with matlab) only once per file and doing the splitting into individual columns yourself (assuming that this step is even necessary)
Preallocating your Big_* cell arrays would also help marginally.
  2 Comments
Christos Antonakopoulos
Christos Antonakopoulos on 4 Dec 2015
Thank you,
Yes the preallocating is done already. Regarding the csvimport function it is needed since i have many string values inside, csvread if i am not wrong does not work. Unfortunately, i can not change the way the data are stored and taken.
Guillaume
Guillaume on 4 Dec 2015
Whichever function you use, the biggest and simplest speed up you can make is to read each file once instead of three times. So ask for your columns all at once rather than in three different calls to the reading function.
If csvread does no work, other options are textscan, which requires a bit more work on your part (you have to open and close the file yourself) or readtable which is dead simple to use but comes with the overhead of tables.
Or you could just parse each file yourself with regexp as I showed you in one of your questions.

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!