manually subsetting data for training and testing purposes

Question

1 vote

I have a dataset containing locations, rows for each observation (month) and various climate data in the columns. Something that looks like this:

site_number month datacol1 datacol2 datacol3 etc...
1 Jan data1 data2 data3 ....
1 Feb data1 data2 data3 ...
....
1 Dec data1 data2 data3
...
2 Jan data1 data2 data3 ....
2 Feb data1 data2 data3 ....
.....
2 Dec data1 data2 data3
....
....
etc...

I want to create training and testing data for this dataset, and I want these datasets to be in blocks containing groups based on their individual site number (each site containing 12 observations for months in the rows).

To put into context I have 28 sites altogether, and want to cross validate using a testing dataset containing 3 sites (total of 36 rows), and a training dataset containing the other sites, grouped by site number.

Can anybody advise on how I can do this please?

2 Comments
Show None Hide None

Chad Greene on 7 Oct 2017

Open in MATLAB Online

Hi Roisin,

I'm not sure what you mean by training and testing data, or what you mean by cross validating. These seem like very general terms, but it sounds like they mean something specific to you. (This likely reflects my own ignorance of the methods you are trying to implement.)

If you want to create some array x containing all the data1 values corresponding to site 2, you could do so with

x = datacol1(site_number==2);

Does that answer your question?

Roisin Loughnane on 8 Oct 2017

Hi Chad,

Thank you for your reply, but that's not exactly what I am looking for. I want to separate the dataset into blocks based on the site numbers (1st column). I want all data to be included in this block, not just datacol1. Help please! I'm sure this is easy but what I've found (unique/accumarray/splitapply) has not helped me so far.

The testing and training data is really just data partitioning/subsetting in order to fit models to a proportion of the dataset, and test on the remaining data for boosted regression trees.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Drew on 13 Apr 2024

0 votes

In R2023b, a "custom" partition functionality has been added to cvpartition. This functionality can be used to create a cvpartition that groups the sites together. See "Version History" section of cvpartition doc page.

R2023b: Create custom cross-validation partitions

The cvpartition function supports the creation of custom cross-validation partitions. Use the CustomPartition name-value argument to specify the test set observations. For example, cvpartition("CustomPartition",testSets) specifies to partition the data based on the test sets in testSets. The IsCustom property of the resulting cvpartition object is set to 1 (true).

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 2

Kian Azami on 8 Oct 2017

Open in MATLAB Online

1 vote

for making training and testing dataset you can use the following commands:

% First you make crossvalidation partitioning on your data
% y is a vector which contains the categories of your observations
% 'HoldOut' an optional property to make training and test set
% Fraction of data to form test set
c = cvpartition(y,'HoldOut',p)
% Now you can find the indices of your training and test sets
trainingIdx = training(c);
testIdx = test(c);
% Now you can find your training and test data
trainingData = Your_Data(trainingIdx,:);
testData = Your_Data(testIdx,:);
% Then you can learn from your training data
% y is the response variable in your data
Trained_Model = fitcknn(trainingData,'y')
% You can predict the Test data by the Trained_Model you defined
Pre_Test = predict(Trained_Model,testData)
% Finally you can calculate the error of your model for this test data
testErr = loss(Trained_Model,testData)

9 Comments
Show 7 older comments Hide 7 older comments

Roisin Loughnane on 8 Oct 2017

I have all data in a table. The problem is it does not divide based on the site class. It takes the given class (site in my case) and partition based on this class, putting a proportion of each in both training and test data. I want all of the site observation to stay together, in either training or test data.

Kian Azami on 8 Oct 2017

Roisin, I think you are confusing two different things which I do not understand.

You want to divide your data into test data and training data. I mean if you have 100 observations you keep 30 for test and 70 for training. But on the other hand, you want to have all the observations together! Maybe you have something else in your mind, but you do not explain it correctly.

Maybe you need to create another larger category (another excel column) and say that these first 10 sites are group1 and these second 10 sites are group2 and so on... Then you do the learning process by that larger groups.

Sorry, but these were the things that I had in my hand, to give you some ideas and suggestions;)

Sign in to comment.

manually subsetting data for training and testing purposes

2 Comments
Show None Hide None

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (1)

9 Comments
Show 7 older comments Hide 7 older comments

Categories

Tags

Community Treasure Hunt

manually subsetting data for training and testing purposes

2 Comments Show None Hide None

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (1)

9 Comments Show 7 older comments Hide 7 older comments

Categories

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

0 Comments
Show -2 older comments Hide -2 older comments

9 Comments
Show 7 older comments Hide 7 older comments