How can I work with the deep learning toolbox when dealing with large datasets that are NOT images?
4 views (last 30 days)
Show older comments
Ana Guerra Langan
on 18 Nov 2019
Answered: Raunak Gupta
on 22 Nov 2019
Hello,
It's my first time dealing with the Deep Learning Toolbox and with large datasets in Matlab, so honestly, any help or direction you can give me will be extremely helpful. I'm finding it really hard to start since most of the documentation is about how to work with images, and I'm working on a different problem.
I have run simulations in Simulink and have over 2000 files with over 50000 timesteps each (about 100GB of data). Each of these files have about 60 parameters recorded, of which I want to use 39 as inputs and 2 as targets for my training (the remaining parameters would not be used for training). I do not want to train the network for a timeseries, so each timestep is a separate/independent datapoint in my training.
So far I know that:
- 100GB of data is too much to load for training
- There are some "datastore" functions I can use to somehow load/read the data but not use memory in it. Not entirely sure this is correct but the documentation linking datastores to Deep Learning Toolbox only considers images which makes it a bit harder to understand.
And I have divided my task into some steps I think I need to follow:
- Build a datastore reading function (custom) that reads my files and extracts the 41 variables I need for training. a) I have no idea what this reading function even looks like or how it should be written. b) I do not know the format I need for these training variables, some are inputs and some targets,... how does the network deal with this, how do you specify which ones?
- Find a way to get the data into the training function. Examples, again, are focused on images and it's hard to extrapolate that into different examples. In my case, each file has over 50000 training samples (multiplied by the 41 variables involved); if these were images, each file would be one single image. How does the training function understand this? How do I specify that it must deal with the samples in that way?
I will emphasise I have read the documentation already, on Deep Learning Toolbox, training and the datastore: half of the things I don't understand what they mean or I'm not able to extrapolate to my problem. I would appreciate if anyone that has worked on this for longer could give me a hand by explaining some things to me, pushing me into the right direction,...
Thanks!
0 Comments
Accepted Answer
Raunak Gupta
on 22 Nov 2019
Hi,
Since it is mentioned that timeseries data is to be considered as separate data points and the targets also seems to be regression values rather than classes I suggest using a Neural Network with enough depth (4-5 layers) with regression layer at the end. The depth factor comes as the amount of data is huge (approximately 2000*50000 = 100 million datapoints) So, using a shallow network will underfit the data. So, sufficient number of parameters (weights) are required to effectively learn from the data pattern. The Regression Neural Network can be made using fitnet with hiddenSizes as row vector with number hidden nodes in each layer. The first and last element of the hiddenSizes row vector must be 39 and 2 respectively as two parameters need to be estimated.
For data I assume each files that is mentioned is a ‘.mat’ files containing 50000 rows and 60 columns. You may create a fileDatastore which has location of each file. Then each file can be read as follows.
training_data = fileDatastore('training_location','ReadFcn',@load,'FileExtensions','.mat');
% Loading one file :- Looping over for reading each file in one epoch
data_struct = load(training_data.Files{1});
% Assuming while saving the file the mat file data was stored with name train and column 1 to 39 are training features and 40-41 are labels. You may change according to problem
data = data_struct.train;
train = data(:,1:39);
label = data(:,40:41);
After that you may read the files one by one with labels and can use train for training the network with each file. Note that there are 2000 files so 1 epoch will be completed after passing all 2000 files to the network. You may initially split the number of files randomly into training, validation, testing as per requirements and set up the fileDatastore separately.
It will be better saving the network after some iterations at checkpoint path (Here one iteration means passing one file). Also using GPU and Parallel Pool will fasten the process as mentioned here. For training you may loop over epochs and then each file as mentioned here.
Hope I clarify some things from the questions.
0 Comments
More Answers (0)
See Also
Categories
Find more on Pattern Recognition and Classification in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!