Is my neural network also using testing data to predict and not only the training and validation data?

4 views (last 30 days)
I made a neural time series analysis similar to the default one in the nnstart wizard. The only change I made to the code was to use divideblock instead of dividerand for my data to be divided into blocks. I have my time series analysis use the earliest data as training, then later data as validation, and finally the latest data as testing, which is why I used divideblock instead of dividerand. Using dividerand would scatter testing targets throughout the time frame rather than at the very end. I'm more interested in seeing if it can make accurate test outputs after the last training data point rather than between two training points.
Out of the 7000+ input data points, I set my training + validation to the first 99.8% of data and my test to the last 0.2% since I only want to see how well it can predict in the future short term (only 14 data points ahead). The results looked okay. I wanted to test the data manually to make sure it wasn't using the test data to learn even though the tutorial stated the test data was independent. The exact words of the tutorial for test data: "These have no effect on training and so provide an independent measure of network performance during and after training."
I set my last 4 data inputs to 0 (normally would be in the thousands like almost all the inputs prior). So the first 10 values of my test data were normal (in the thousands) and the very last 4 were set to 0. I ran the code and data returned the model showing the last 4 test outputs were now fitted to fall to 0 by the end. I went back and set them to their normal values (in the thousands) and ran it and the test outputs were nearly spot on (in the thousands) and not falling to 0 like they were the other runs.
If the test targets were truly independent of the training and validation data, how is this happening? Every time I ran it with the last 4 test targets at 0, the test outputs drop to 0. If they were reset to their normal values, they would be again almost spot on. To me, that means it's using the test set as part of training or at the very least using the most recent test target to make make the next test output (as if to correct a mistake if it was way off on the last data point prediction). Is there something I'm missing? Are the test targets being used incrementally for the next test output if it gets it wrong or is it being used in the training set somehow? How can I set it to completely ignore test targets in making a test output?
Edit: Screenshot of graph attached.

Answers (2)

Greg Heath
Greg Heath on 25 Jun 2017
Edited: Greg Heath on 12 Jul 2017
1. a. I'm glad that you agree with me that DIVIDEBLOCK should be the default for time series prediction ( search NEWSGROUP and ANSWERS with GREG DIVIDEBLOCK ).
b. The current default of DIVIDERAND is obviously only good for interpolation.
c. I disagree with you w.r.t. the test subset data being involved in training.
It definitely should not be. You can prove this by replacing the entire the
test subset with zeros and setting the rng to the same initial state. You
should get the same weight values.
2. You have provided insufficient information. It is very rare that someone can get their point across clearly without posting code with comments.
3. You use the word input. I hope you are using NARNET for which there is no applied input. Otherwise things don't make sense.
4.
[ I N ] = size(input) = ?
[ O N ] = size(target) = ?
ID = input delays = ?
FD = feedback delays = ?
H = No. of hidden nodes = ?
[ Ntrn Nval Ntst]/N = 0.7/0.15/0.15 ?
5. What is the data point number for the 1st test point?
6. Please post your code.
Hope this helps.
Greg
  3 Comments
Greg Heath
Greg Heath on 26 Jun 2017
Edited: Greg Heath on 26 Jun 2017
Thanks for the detail. I will read it ASAP.
Unfortunately my computer was sick and I lost my MATLAB code. So far I haven't been able to load the replacement yet. Hope to get to it this week.
UGH ...
The code produced by nnstart is NEVER recommended to newbies by me. It is too voluminous and doesn't emphasize the important points. For example: First there is a long explanation paragraph; then there is code that just assigns default values.
The code produced by the help and doc commands is SHORT and ALMOST SWEET.
help narnet
doc narnet
I explain the "ALMOST" in my recent SHORT & SWEET (;>) QUICKIES post
https://www.mathworks.com/matlabcentral/newsreader/view_thread/348883#954773
Later,
Greg
Greg Heath
Greg Heath on 26 Jun 2017
1. YOUR EXPERIMENT IS FAULTY
2. The basic assumption of NN learning is that the training, validation, and test subsets have similar summary secondary statistics:
mean, standard deviation and correlation function.
3. Obviously your test subset has a different mean.
4. In addition, if the training subset has no variance, the function can be modeled with zero value weights: the output can be soley created with the biases.
5. One way to prove that the test set does not affect learning is
a. Use the example data in the documentation
i) help narnet
ii) doc narnet
b. Assign an initial state to the RNG (e.g., RNG(0))
c. Design net1 with the documentation code in a
and DIVIDEBLOCK
d. Reassign the same initial RNG state as in a
e. Design net2 with DIVIDEBLOCK and Ntst = 0
f. Compare the two sets of weights and biases.
6. I have not demonstrated this. However, if you have the time ... (;>)
Hope this helps.
Greg

Sign in to comment.


ztune9
ztune9 on 29 Jun 2017
Edited: ztune9 on 29 Jun 2017
Thanks for your reply. The first time I ran it and noticed this bias, my training subset had 7000+ data points with a variance (that was in my first post with the first attached image). I created that second test with far less data points (only 100) and all training with a value of 100 to simplify my question as much as possible since the bias was reproducible and more obvious (the image in the second post I made). All I'm asking is why the test output is following the test targets when I have trained it not to do so.
I understand this statement you made: basic assumption of NN learning is that the training, validation, and test subsets have similar summary secondary statistics.
I wanted to actually see if that statement was true to the programming and from what I ran, it doesn't appear so. I intentionally trained my data to give an output of 100 every single time. Instead, it decided to use the test targets to adjust its response. I don't see anywhere in matlab's documentation how and why it's using my test target to adjust its test output when it should be independent.
I think using the training dataset with zero variance data set is actually putting their claim to the test as I know EXACTLY what the test output should be if I trained it exactly what to say. If I use a dataset with a lot of variance to train, I won't be able to identify if the variable test outputs are biased to the test target or not. The way I did it, it shows that it's biased and using the test target to adjust. I want to know how it's doing that and I want to remove that function so it gives an unbiased target. I may have to contact matlab directly as I think it's something in their function coding and something I can't change.
  5 Comments
ztune9
ztune9 on 14 Jul 2017
To give an update, I've been going back and forth with Matlab on this. My conclusions are the test targets affected the test outputs when they should not have so Matlab and I changed the code I posted earlier to set the "net.divideParam.testRatio = 15/100;" to 0 and the other two to total 100 (70 and 30). Since the test data was supposed to be "independent" training, I wanted to make sure that it had zero affect on it by removing it entirely and I will make any comparison of test data to predicted offline in XLS.
At any rate, the issue I'm having now is running the net after it's trained to generate an output. I'm awaiting a response from Matlab, but the commands I was provided were generating outputs off what I expected and I'm not sure it was a time series input-based command now (the new net/function I trained was a simple linear y=x data set so time point 1=1, 2=2, etc up to 100).
My question is: how can I take the trained network (that used time points 1 to 100) and have it generate an output at time point 101? There is no input in a sense there is no data I'm entering to get an output other than a specific time point. I'd like a simple command where I ask the trained net to give me a value at time point 101 after having trained and validated using time points 1 to 100.
Hopefully such a simple command exists and it works and I can put this topic to rest.
ztune9
ztune9 on 18 Jul 2017
I got the answer from matlab. After the net is trained, these commands need to be performed. I was able to get all my questions answered by matlab support and no longer need assistance in this thread. Thanks.
>> test = zeros(1, 5) %5 in this case is the number of values forward you want predicted
>> testData = tonndata(test, true, false)
>> testResult = netc(testData, xic, aic)

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!