How well can I predict task performance from predictor variables?
Show older comments
Hello,
I have the following research question: I would like to predict the performance in a response time experiment (participants have to respond as fast as possible to a target stimulus) from three neural measures: Amplitude of an EEG signal, speed of a saccade (eye movement), and activity in a specific brain area as measured with fMRI.
What I have is a matrix with 5 columns: participant ID, EEG, saccade, fMRI, response time. The first column is just to identify the participants, columns 2-4 are predictor variables and the fifth column is the to-be-predicted variable.
Here are the actual questions: What would be a good way of testing how well I can predict the task performance? A regression I assume? Which function in MATLAB would you recommend? Does it make sense to segment the participants before running the regression?
Thanks,
Tim
1 Comment
dpb
on 7 May 2021
We have absolutely no way to answer your question having no knowledge of the experiment.
Rightfully, the analysis methods would have been picked first and then the experiment designed and executed so as to be able to estimate the parameters of the model.
See a G. E. Box white paper that outlines some of the possible problems here <Regression Analysis Applied to Happenstance Data>
Given you already have the data and likely can't repeat the experiment, one must do what one can to at least be aware of potential issues unless the data were taken under well-controlled circumstances.
As for the last Q? specifically, "maybe"; your independent variables are markedly lacking in such information as age, sex, health status, etc., etc., etc., ... all of which may be the easily thought of confounding variables of which Professor Box speaks, not to mention less obvious but maybe even more important to the results of things like amount and quality of sleep the night before, etc.,
Answers (1)
Scott MacKenzie
on 7 May 2021
Edited: Scott MacKenzie
on 8 May 2021
If you want a prediction equation expressing RT as a linear function of "amplitude of EEG signal", "speed of saccade", and "fMRI brain activity" and you've already collected the data, this is doable. Of course, whose to say the relationship with each of these variables is linear. But, that's another story (see dpb's comment).
The following code with fake data for 10 participants demonstrates the mechanics of building such a model. And it should get you thinking about your goals.
eeg = rand(10,1);
saccade = rand(10,1);
fmri = rand(10,1);
rt = rand(10,1);
data = [ones(size(eeg)) eeg saccade fmri];
[b, ~, ~, ~, stats] = regress(rt, data)
Output:
b =
0.86189
-0.16142
-0.58097
0.034387
stats =
0.3729 1.1893 0.39018 0.12236
The prediction equation is
rt = 0.861 - 0.161 x eeg - 0.581 x saccade x 0.034 x fmri
with R^2 = 0.3729.
I suggest you read the documenation on the regress function and study the examples. Good luck.
12 Comments
Toby Feld
on 8 May 2021
Scott MacKenzie
on 8 May 2021
Yes. This is noted in the first entry in the regress documentation:

Excluding the ones-column forces the intercept to zero. The fit won't be quite a good, but there might a reason for wanting a zero-intercept.
As for non-linear models, you can do this by manipulating the variables before passing them into the regress function. For example, if you thought the model should be built using the EEG values "squared", then
data = [ones(size(eeg)) eeg.^2 saccade fmri];
Your final model would be
rt = m + b1 x eeg^2 + b2 x saccade + b3 x fmri
You might get a higher R^2 value and conclude that this is a better model for the empirical data you have collected.
Log and power transformations are common. Just do a google search on "log transformation" or "power transformations" and you'll find a ton of information. Good luck.
dpb
on 8 May 2021
I would recommend fitlm as it produces much more in the way of diagnostics information than regress on its own.
"Which function would you recommend if the relationship is not linear?"
Without visualization and knowledge of what experiment design was first, I'd not "recommend" any function. You can blindly fit any model that is estimable and you can add enough terms to get a high R-sq for virtually any dataset.
Scott MacKenzie
on 8 May 2021
These are good points. Simply fishing around for higher correlations isn't the way to go.
Toby Feld
on 8 May 2021
dpb
on 8 May 2021
" I do indeed not want to fish around for good model fits,..."
Again, without some basis for a model, "correlation does not imply causation" so that really is all you are doing whether it is one term or a hundred.
The idea in general is as noted before; one starts with some hypothesis and tries to design and execute an experiment to prove/disprove the hypothesis.
Simply collecting data and making some sort of empirical fit is only that...we don't even know how many subjects there were, besides the other potential issues I see in the sample population that are unmeasured.
Since this is not a case where you can set the level of one of the hopeful predictor variables to measure a response but all variables are responses, without knowing something about what kind of range there is in those, it's not even clear there's a reason to fit one or more of the variables.
IMO, there's just too much unknown to us here to be willing to make any recommendations whatever...
Have you done any visualization of the data?
dpb
on 8 May 2021
Can you attach the data or at least a sizable-enough chunk so as to be able to see what it looks like/
Toby Feld
on 10 May 2021
Scott MacKenzie
on 10 May 2021
Sorry to jump in here and dpd may have additional comments, but here's a quick analysis using fitlm:
% load variables (EEG, fMRI, saccade, rt)
load modellingRT
x = [EEG fMRI saccade]; % predictor variables
y = rt; % response variable
mdl = fitlm(x, y);
mdl.Coefficients.Estimate
mdl.Rsquared
Output:
ans =
464.76
18.336
37.777
0.11507
ans =
struct with fields:
Ordinary: 0.93908
Adjusted: 0.93462
With a bit of rounding of the coefficients, the model is
rt = 465 + 18.3 x EEG + 37.8 x fMRI + 0.115 x saccade
with R^2 = .9391.
That's a pretty good model. A reasonabe conclusion is that the model explains 93.9% of the variation in the data.
There are many other stats available through the mdl structure. Good luck.
Toby Feld
on 10 May 2021
Scott MacKenzie
on 10 May 2021
I'm not sure. Perhaps dpb will have some ideas to offer.
I don't have time to do much right now; I am interested and will try to get back later -- just one observation to emphasize what was said before -- "R-sq isn't the tell-all, end-all" to evaluate a model.
" I also tried nonlinear apporaches:lm = fitlm([EEG saccade fMRI],rt,'quadratic') and get an even better R^2 ..."
- A quadratic surface is still a linear model; just higher order;
- Of course you get a higher R-sq, you've added six (6) additional terms and reduced the residual numbers of DOF by that many as well.
You seemingly still haven't looked at the model nor the data itself, though...the "exploratory" part --
>> mdl=fitlm([EEG saccade, fMRI],rt)
mdl =
Linear regression model:
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ ________ __________
(Intercept) 464.76 8.2759 56.158 2.0735e-40
x1 18.336 2.3883 7.6777 1.8516e-09
x2 0.11507 3.0848 0.037303 0.97042
x3 37.777 3.1301 12.069 4.4618e-15
Number of observations: 45, Error degrees of freedom: 41
Root Mean Squared Error: 18.3
R-squared: 0.939, Adjusted R-Squared: 0.935
F-statistic vs. constant model: 211, p-value = 6.17e-25
>>
NB: that coefficient x2 ~ saccade has a SE (standard error of estimate) that is ~30X the magnitude of the coefficient -- IOW, it is meaningless as that says the coefficient is ~0.1 +/- 3 -- or anywhere between [-2.9, 3.1].
So, to interpret this model more accurately, it's really the same thing as
>> fitlm([EEG, fMRI],rt)
ans =
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ ______ __________
(Intercept) 464.85 7.7969 59.62 3.197e-42
x1 18.367 2.2196 8.2748 2.3207e-10
x2 37.852 2.3729 15.951 1.9796e-19
Number of observations: 45, Error degrees of freedom: 42
Root Mean Squared Error: 18.1
R-squared: 0.939, Adjusted R-Squared: 0.936
F-statistic vs. constant model: 324, p-value = 3.02e-26
>>
which actually is just slightly better with fewer terms -- RMSE 18.1 vs 18.3
"Everything should be a simple as possible, but not simpler." -- Einstein
Goes for model-building as well as physics.
This doesn't even start on residuals analyses, etc., etc., etc., ...
Categories
Find more on Linear and Nonlinear Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!