During time-by-time decoding or regression, should I tune hyper-parameters for every time point individually?

richard · July 26, 2024, 9:41am

This is more a general machine learning than a specific MNE-Python question, but here goes:

When training and scoring a classifier or regressor timepoint by timepoint to evaluate the performance evolution over time, is it advisable – or rather a methodological mistake? – to tune the model’s hyper-parameters on every time point?

Take for example the case where I want to predict a continuous response variable (e.g., a rating) collected at the end of a trial based on electrophysiological data in the few seconds preceding the rating. Assume I want to use ridge regression, which critically depends on a suitable regularization parameter, alpha, which is commonly found through cross-validation. Would I run this cross-validation (e.g., via scikit-learn’s RidgeCV) for every time point separately and then use this alpha for this specific time point, and potentially a whole different alpha for the next time point? Or would I rather average the alphas found for all time points and then use the same (mean) alpha for all time points?

I suppose the answer depends on the exact research question and interpretation of the results, but I couldn’t find any guidance in the literature I consulted. Any pointers would be highly appreciated!

Richard

cautiously tagging @agramfort here

KristijanArmeni · July 26, 2024, 7:44pm

Coming from experimental neuroimaging background, I do not ever recall seeing averaging. If a separate regression/classifier model is fit per time-point, I’d think the reasonable default is to allow a separate hyperparameter selection per time-point and use timepoint-specific alphas.

My naive understanding is that if you perform time-point specific fits and hyperparameter optimization, but then enforce a single alpha (e.g. an average) you might end up injecting high shrinkage onto model coefficients at time-points were low shrinkage was found and vice versa. So you might be over-fitting at some time points and under-fit at others?

I can’t think of a motivation for averaging alphas. Coefficient interpretability?

Perhaps this can be a pointer: Dupre la Tour et al, 2022: Feature-space selection with banded ridge regression: https://doi.org/10.1016/j.neuroimage.2022.119728 where they motivate the use of separate hyperparameters per feature space (in a jointly fit encoding model) where I’d think of your individual signal time-points as separate feature spaces (in a decoding model).

Not sure if the above holds water at all. Curious to see if others have more principled reasons for one or the other option.

Topic		Replies	Views
Temporal Generalization - Different results with and without using cross validation Mailing List Archive (read-only) list-archive	5	240	April 5, 2020
Computing regression on sensor data then transforming to source space Mailing List Archive (read-only) list-archive	12	277	February 25, 2014
Multicollinearity in Temporal Generalization Analysis in the case modif moving time window Support & Discussions	1	18	April 3, 2025
Which time points should I use to compute the covariance matrix? Mailing List Archive (read-only) list-archive	1	206	March 30, 2011
Is there a built in way with SlidingEstimator to get the scores for every epoch on the test set when decoding? Support & Discussions	1	272	February 25, 2023

During time-by-time decoding or regression, should I tune hyper-parameters for every time point individually?

Related topics