manually subsetting data for training and testing purposes

I have a dataset containing locations, rows for each observation (month) and various climate data in the columns. Something that looks like this:
site_number month datacol1 datacol2 datacol3 etc...
1 Jan data1 data2 data3 ....
1 Feb data1 data2 data3 ...
....
1 Dec data1 data2 data3
...
2 Jan data1 data2 data3 ....
2 Feb data1 data2 data3 ....
.....
2 Dec data1 data2 data3
....
....
etc...
I want to create training and testing data for this dataset, and I want these datasets to be in blocks containing groups based on their individual site number (each site containing 12 observations for months in the rows).
To put into context I have 28 sites altogether, and want to cross validate using a testing dataset containing 3 sites (total of 36 rows), and a training dataset containing the other sites, grouped by site number.
Can anybody advise on how I can do this please?

2 Comments

Hi Roisin,
I'm not sure what you mean by training and testing data, or what you mean by cross validating. These seem like very general terms, but it sounds like they mean something specific to you. (This likely reflects my own ignorance of the methods you are trying to implement.)
If you want to create some array x containing all the data1 values corresponding to site 2, you could do so with
x = datacol1(site_number==2);
Does that answer your question?
Hi Chad,
Thank you for your reply, but that's not exactly what I am looking for. I want to separate the dataset into blocks based on the site numbers (1st column). I want all data to be included in this block, not just datacol1. Help please! I'm sure this is easy but what I've found (unique/accumarray/splitapply) has not helped me so far.
The testing and training data is really just data partitioning/subsetting in order to fit models to a proportion of the dataset, and test on the remaining data for boosted regression trees.

Sign in to comment.

Answers (2)

for making training and testing dataset you can use the following commands:
% First you make crossvalidation partitioning on your data
% y is a vector which contains the categories of your observations
% 'HoldOut' an optional property to make training and test set
% Fraction of data to form test set
c = cvpartition(y,'HoldOut',p)
% Now you can find the indices of your training and test sets
trainingIdx = training(c);
testIdx = test(c);
% Now you can find your training and test data
trainingData = Your_Data(trainingIdx,:);
testData = Your_Data(testIdx,:);
% Then you can learn from your training data
% y is the response variable in your data
Trained_Model = fitcknn(trainingData,'y')
% You can predict the Test data by the Trained_Model you defined
Pre_Test = predict(Trained_Model,testData)
% Finally you can calculate the error of your model for this test data
testErr = loss(Trained_Model,testData)

9 Comments

Another way to do these things is to use the application of 'Classification Learner' in MATLAB.
Thank you Kian. Can you tell me about the first y vector, of categories. How can I make these categories?
I know 'holdout' uses class information for stratification into groups, but does so so that both training and test sets have roughly the same class proportions. I don't think this is what I want. I want my dataset to be taken in blocks/groups based on their site number.
If you want to do a supervised learning you need to assign an output to your observations which is called response class. As I understood from your question your response class is the 'Sites'. You can make the site a categorical variable as follow:
Sites = categorical(sites);
I think it is better to put all of your data in an excel sheet. Then read that excel sheet by the:
T = readtable('YourExcelFile.xlsx');
T.Sites = categorical(T.Sites);
c = cvpartition(T.Sites,'HoldOut',0.3)
Otherwise, you need to define both 'Observations' and 'Response' vectors in the cvpartition command. I don't have matlab just right now to tell you how to define them inside the cvpartition but if you look at the syntax in the help you will understand easily.
These commands exactly group your data based on the site.
This is very helpful, but by using 'holdOut' the test data takes one observation per site, leaving the site block incomplete which I don't want. I want to keep all 12 observations of the site together.
This is just a random division that the cvpartition does. if you put p = 0.3 it put 30 percent of total observations for test and others for training. But it is no problem for you to reconstruct the whole data again. Because you have the indicies of the test and training data. So if you want the whole data you can define another new matrix like:
All_Data = [trainingData;testData];
By the way, the 'HoldOut' is not an obligatory input. If you look at the help of cvpartition maybe you find something that is more useful for your application. For example:
c = cvpartition(n,'KFold',k)
It will devide your n observations into k different subsamples.
Yes, but how do I say to take k subsamples but in groups of 12 based on their site? I have looked at the help and have tried to work around using both holdout and kFold. The problem is that I don't know how to say to take the rows as blocks according to their site number.
I use this way. I put all the data in excel and then by using the command readtable (which I explained previously) I will load it as a table in matlab. (You need to create table of your data in anyway you can) Now, when you define the response class in cvpartition, it will understand that it should devide all the data in the table you defined based on the response class.
I have all data in a table. The problem is it does not divide based on the site class. It takes the given class (site in my case) and partition based on this class, putting a proportion of each in both training and test data. I want all of the site observation to stay together, in either training or test data.
Roisin, I think you are confusing two different things which I do not understand.
You want to divide your data into test data and training data. I mean if you have 100 observations you keep 30 for test and 70 for training. But on the other hand, you want to have all the observations together! Maybe you have something else in your mind, but you do not explain it correctly.
Maybe you need to create another larger category (another excel column) and say that these first 10 sites are group1 and these second 10 sites are group2 and so on... Then you do the learning process by that larger groups.
Sorry, but these were the things that I had in my hand, to give you some ideas and suggestions;)

Sign in to comment.

In R2023b, a "custom" partition functionality has been added to cvpartition. This functionality can be used to create a cvpartition that groups the sites together. See "Version History" section of cvpartition doc page.
R2023b: Create custom cross-validation partitions
The cvpartition function supports the creation of custom cross-validation partitions. Use the CustomPartition name-value argument to specify the test set observations. For example, cvpartition("CustomPartition",testSets) specifies to partition the data based on the test sets in testSets. The IsCustom property of the resulting cvpartition object is set to 1 (true).

Asked:

on 7 Oct 2017

Answered:

on 13 Apr 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!