During time-by-time decoding or regression, should I tune hyper-parameters for every time point individually?

This is more a general machine learning than a specific MNE-Python question, but here goes:

When training and scoring a classifier or regressor timepoint by timepoint to evaluate the performance evolution over time, is it advisable – or rather a methodological mistake? – to tune the model’s hyper-parameters on every time point?

Take for example the case where I want to predict a continuous response variable (e.g., a rating) collected at the end of a trial based on electrophysiological data in the few seconds preceding the rating. Assume I want to use ridge regression, which critically depends on a suitable regularization parameter, alpha, which is commonly found through cross-validation. Would I run this cross-validation (e.g., via scikit-learn’s RidgeCV) for every time point separately and then use this alpha for this specific time point, and potentially a whole different alpha for the next time point? Or would I rather average the alphas found for all time points and then use the same (mean) alpha for all time points?

I suppose the answer depends on the exact research question and interpretation of the results, but I couldn’t find any guidance in the literature I consulted. Any pointers would be highly appreciated!

Richard

cautiously tagging @agramfort here

Coming from experimental neuroimaging background, I do not ever recall seeing averaging. If a separate regression/classifier model is fit per time-point, I’d think the reasonable default is to allow a separate hyperparameter selection per time-point and use timepoint-specific alphas.

My naive understanding is that if you perform time-point specific fits and hyperparameter optimization, but then enforce a single alpha (e.g. an average) you might end up injecting high shrinkage onto model coefficients at time-points were low shrinkage was found and vice versa. So you might be over-fitting at some time points and under-fit at others?

I can’t think of a motivation for averaging alphas. Coefficient interpretability?

Perhaps this can be a pointer: Dupre la Tour et al, 2022: Feature-space selection with banded ridge regression: https://doi.org/10.1016/j.neuroimage.2022.119728 where they motivate the use of separate hyperparameters per feature space (in a jointly fit encoding model) where I’d think of your individual signal time-points as separate feature spaces (in a decoding model).

Not sure if the above holds water at all. Curious to see if others have more principled reasons for one or the other option.

3 Likes