Stacking more than 200K files (same #columns)

Hello, I have more than 200,000 csv files I would like to put/stack/append/concatenate over each other.
All csv file are in the same folder.
All csv files have the same number of columns (80)
But they may have different number of rows (from 5 to 5000)
I am currently using a loop and readtable for each file
out1=readtable(csvfilename{ii},opts);
out2=[out2;out1];
But it has taken forever, and likely out2 may be too big. Is this the correct way to deal with 200k files? Tall tables?
From the 80 columns in the csv, I only need the same 30 columns, this in case I can directly read those 30 to avoid making the final file too big.

Answers (1)

Cris LaPierre
Cris LaPierre on 13 Oct 2021
Edited: Cris LaPierre on 13 Oct 2021
Consider using a datastore. You can see an example in our Importing Multiple Data Files video from our Practical Data Science with MATLAB specialization on Coursera.
I'm not sure what 'taking forever' means, but it is going to take a while to load 200k files. because the array size hasnt' been preallocated, you are going to encounter memory issues as the array grows, as MATLAB has to keep moving it to larger and larger blocks of continuous membory (see here) If the size gets too big, you may need to look into using a tall array in order to work with the final result. In that case, you may want to look into a TallDatastore.

4 Comments

Thanks.
Following your suggestion I'm using datastore now, seems faster.
However, doing readall prompts an error due to format of var9 (column9)
Error using matlab.io.datastore.TabularDatastore/readall (line 196)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 1, field number 9) ==>
C,9575,2019,01,,1.54,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20190111,NULL\n
Caused by:
Reading the variable name 'var9' using format '%f' from file:
'C:\....csv'
starting at offset 0.
In each different csv, var9 is either always empty or it is always characters like 'ABC'
How can I set var9 to deal w having sometimes empty and sometimes 'ABC'?
Before datastore I was using a loop with readtable and detectimportoptions and had the same problem, when it is empty it called it as NaN,
So in the end the idea is to force it to text (thanks Walter Roberson)
opts = detectImportOptions(names{1});
opts = setvartype(opts,{'var9'},'char');
Then
[out1]=readtable(names{ii},opts);
When you just use readtable without setting any options, it will autodetect datatype. It sounds like you need to explicitly set the datatype for all files since the variable may be empty is some files. You can do this by writing your own import function and using fileDataStore, which allows you to specify a function that reads the datafiles. You can see this explained around the 2-minute mark in the linked video.
As for how to create the import function, you can try to do this manually, but the simplest way is to use the Import Tool and, once you have set things up the way you want, click 'Generate Function'. That same specialization has a How to use the Import Tool video and a Generating and Reusing Code video.
Thanks, to avoid the error I am using
ds1=datastore('C:...*.csv')
ds1.TextscanFormats{9}='%q'
ds2=readall(ds1);
I'm trying in a smaller set of 5 files (vs 200k) and it works
But, is there a way to use readall (or read) to get specific columns only?
specCol={'var1', 'var9','var65'}
However running:
ds1.SelectedVariableNames=specCol;
I get
Error using matlab.io.datastore.TabularTextDatastore/set.SelectedVariableNames (line 714)
SelectedVariableNames must be a unique subset of VariableNames.
nvm, there was a typo in a variable name.

Sign in to comment.

Categories

Asked:

on 13 Oct 2021

Commented:

on 13 Oct 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!