I am trying to perform temporal generalization decoding on data that were binned in time, and I am slightly confused as to what the best way is to format the data matrix.

For my specific analysis, I am extracting the data from two different conditions from my mne epochs data. These extracted data are then binned in time, in 10ms bins. Assuming my sampling rate is 1000Hz, I have 10 samples per bins, and my epochs being 1000ms, I have 100 such bins. My first intuition was to set the matrix like so:
40 trials * 20 channels * 10 samples * 100 bins
By setting my matrix this way, the last dimension corresponds to my â€śtime dimensionâ€ť I want to generalize across and the generalized estimator should work as expected. Another option however would be to vectorize the time bins together with the channels like so:
40 trials * (20 channels * 10 samples) * 100 bins, effectively resulting in a matrix with the following dimensions:
40 * 200 * 100

Are the two different ways to set the matrices equivalent for the generalized estimator or is one better than the other?

My sense sense is that as far as the generalized estimator is concerned, this doesnâ€™t matter. However, when using mne Scaler function, the first option should be favored as the second dimension only consists in channels as expected by the function.

One of the reason I am worrying about such issues is that I see a lot of above chance decoding in my baseline period that can hardly be explained by the experimental design. I am worried this is something related to how my matrix is set.

I am not averaging within each time bins. I am rather willing to use 10ms of data as a feature as opposed to a single time point. For our specific project, we believe that the encoding of the information is spread in time as opposed to being present at single time points. Therefore, by binning (without averaging), my understanding is that we should be able to better capture what we are after than by performing a fully time resolved analysis.

My understanding from the mne.decoding.GeneralizingEstimator function is that it is taking the last dimension to generalize across and therefore taking as feature for a given trial the data in the remaining dimensions. In other words, when the matrix is as follows:
40 trials * 20 channels * 10 samples * 100 bins
A given feature will have the dimensions:
20 channels * 10 samples for a given trial and time bin. Alternatively, in the second option, features will be one dimensional with 200 data points. My understanding is that both these are fully equivalent as the data point contained in each features are the same. Do you confirm that this is the case?

My apologies if this is unclear, I am rather new to this type of analysis.

Note that I resolved the baseline issue a few minutes ago and it was completely unrelated to the binning or to any filtering steps. This had to do with the fact that I was making a mistake when randomly sampling trials at an earlier step with replacement, meaning the same trials were occurring several times, inflating the decoding across the board by decreasing variance within decoding targets groups. This random resampling is independent from cross fold validation and is due to the fact that I am working with iEEG data for which I need to aggregate all subjects data as a super subject to perform the decoding.

ok so mne.decoding.GeneralizingEstimator should indeed work (at least not crash).

now one issue I have is that if you use a linear model under the hood you cannot expect
it to deal with random phase or temporal jitters between windows.

but maybe itâ€™s worth trying stillâ€¦

if it crashes can you share a code snippet with random data to reproduce the problem?

Thanks a lot for the feedback (apologies for the late reply). I have tried to implement the described binning both ways and it worked the same as far as I can tell.

Regarding the comments you made about using linear model, is it a general aspect to consider when binning the data as I suggested, or will the implementation of the binning in the two different ways have an influence? As discussed, the binning we are implementing is basically making it such that a classification feature consists of n time points in a given channel as opposed to only a single time point. The two different dimension alternatives (40 trials * 20 channels * 10 samples * 100 bins vs 40 trials * 200 (channels * samples ) * 100bins) should be equivalent with regard to the problem described, correct?