manually subsetting data for training and testing purposes
Show older comments
I have a dataset containing locations, rows for each observation (month) and various climate data in the columns. Something that looks like this:
site_number month datacol1 datacol2 datacol3 etc...
1 Jan data1 data2 data3 ....
1 Feb data1 data2 data3 ...
....
1 Dec data1 data2 data3
...
2 Jan data1 data2 data3 ....
2 Feb data1 data2 data3 ....
.....
2 Dec data1 data2 data3
....
....
etc...
I want to create training and testing data for this dataset, and I want these datasets to be in blocks containing groups based on their individual site number (each site containing 12 observations for months in the rows).
To put into context I have 28 sites altogether, and want to cross validate using a testing dataset containing 3 sites (total of 36 rows), and a training dataset containing the other sites, grouped by site number.
Can anybody advise on how I can do this please?
2 Comments
Chad Greene
on 7 Oct 2017
Hi Roisin,
I'm not sure what you mean by training and testing data, or what you mean by cross validating. These seem like very general terms, but it sounds like they mean something specific to you. (This likely reflects my own ignorance of the methods you are trying to implement.)
If you want to create some array x containing all the data1 values corresponding to site 2, you could do so with
x = datacol1(site_number==2);
Does that answer your question?
Roisin Loughnane
on 8 Oct 2017
Answers (2)
Kian Azami
on 8 Oct 2017
for making training and testing dataset you can use the following commands:
% First you make crossvalidation partitioning on your data
% y is a vector which contains the categories of your observations
% 'HoldOut' an optional property to make training and test set
% Fraction of data to form test set
c = cvpartition(y,'HoldOut',p)
% Now you can find the indices of your training and test sets
trainingIdx = training(c);
testIdx = test(c);
% Now you can find your training and test data
trainingData = Your_Data(trainingIdx,:);
testData = Your_Data(testIdx,:);
% Then you can learn from your training data
% y is the response variable in your data
Trained_Model = fitcknn(trainingData,'y')
% You can predict the Test data by the Trained_Model you defined
Pre_Test = predict(Trained_Model,testData)
% Finally you can calculate the error of your model for this test data
testErr = loss(Trained_Model,testData)
9 Comments
Kian Azami
on 8 Oct 2017
Another way to do these things is to use the application of 'Classification Learner' in MATLAB.
Roisin Loughnane
on 8 Oct 2017
Kian Azami
on 8 Oct 2017
Edited: Kian Azami
on 8 Oct 2017
If you want to do a supervised learning you need to assign an output to your observations which is called response class. As I understood from your question your response class is the 'Sites'. You can make the site a categorical variable as follow:
Sites = categorical(sites);
I think it is better to put all of your data in an excel sheet. Then read that excel sheet by the:
T = readtable('YourExcelFile.xlsx');
T.Sites = categorical(T.Sites);
c = cvpartition(T.Sites,'HoldOut',0.3)
Otherwise, you need to define both 'Observations' and 'Response' vectors in the cvpartition command. I don't have matlab just right now to tell you how to define them inside the cvpartition but if you look at the syntax in the help you will understand easily.
These commands exactly group your data based on the site.
Roisin Loughnane
on 8 Oct 2017
Edited: Roisin Loughnane
on 8 Oct 2017
Kian Azami
on 8 Oct 2017
This is just a random division that the cvpartition does. if you put p = 0.3 it put 30 percent of total observations for test and others for training. But it is no problem for you to reconstruct the whole data again. Because you have the indicies of the test and training data. So if you want the whole data you can define another new matrix like:
All_Data = [trainingData;testData];
By the way, the 'HoldOut' is not an obligatory input. If you look at the help of cvpartition maybe you find something that is more useful for your application. For example:
c = cvpartition(n,'KFold',k)
It will devide your n observations into k different subsamples.
Roisin Loughnane
on 8 Oct 2017
Edited: Roisin Loughnane
on 8 Oct 2017
Kian Azami
on 8 Oct 2017
I use this way. I put all the data in excel and then by using the command readtable (which I explained previously) I will load it as a table in matlab. (You need to create table of your data in anyway you can) Now, when you define the response class in cvpartition, it will understand that it should devide all the data in the table you defined based on the response class.
Roisin Loughnane
on 8 Oct 2017
Kian Azami
on 8 Oct 2017
Roisin, I think you are confusing two different things which I do not understand.
You want to divide your data into test data and training data. I mean if you have 100 observations you keep 30 for test and 70 for training. But on the other hand, you want to have all the observations together! Maybe you have something else in your mind, but you do not explain it correctly.
Maybe you need to create another larger category (another excel column) and say that these first 10 sites are group1 and these second 10 sites are group2 and so on... Then you do the learning process by that larger groups.
Sorry, but these were the things that I had in my hand, to give you some ideas and suggestions;)
Drew
on 13 Apr 2024
0 votes
In R2023b, a "custom" partition functionality has been added to cvpartition. This functionality can be used to create a cvpartition that groups the sites together. See "Version History" section of cvpartition doc page.
R2023b: Create custom cross-validation partitions
The cvpartition function supports the creation of custom cross-validation partitions. Use the CustomPartition name-value argument to specify the test set observations. For example, cvpartition("CustomPartition",testSets) specifies to partition the data based on the test sets in testSets. The IsCustom property of the resulting cvpartition object is set to 1 (true).
Categories
Find more on Support Vector Machine Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!