Analysis of EEG Data with Variable-Length Recordings

Dear MNE,

I am currently working on my student graduation paper, and this is my first attempt at EEG preprocessing and analysis. The dataset I have received consists of 3 sessions with 15 participants and 24 recordings per session, each having its corresponding 24 labels. The recordings involve participants watching videos, with each video corresponding to one of 4 emotions that I aim to recognize using a SVM classifier.

The problem I am encountering, and which I have no idea how to solve after long time, is connected to the fact that each of the 24 recordings has varying lengths. For example, for one participant, there are 24 recordings of [62 channels] x [N] (e.g., 62x33601, 62x19001, 62x9601 etc.).

I am writing this post primarily to ask for suggestions on the correct approach for handling signals of different lengths, as well as an evaluation of the processing approach, which I will describe below. Maybe the issue occurs earlier.

  1. Reading a data from MATLAB type of file and creating Raw for each recording:
info = mne.create_info(ch_names=chan_names, sfreq=1000, ch_types='eeg')
raw = mne.io.RawArray(data=data, info=info)
  1. Downsampling data drom 1000Hz to 200Hz:
downsampled_raw = raw.copy().resample(200, npad="auto")
  1. Bandpassing:
bandpassed_raw = downsampled_raw.copy().filter(l_freq=1, h_freq=75)
  1. Creating epochs of a fixed length (4 seconds as authors of the dataset) since the dataset does not provide any information about events or triggers. Just all recording has its emotion label.
epochs = mne.make_fixed_length_epochs(bandpassed_raw, duration=4, preload=True)
cap_montage(epochs)
  1. Detecting EOG artifacts and dropping them + additional dropping by autoreject.
def ica_repair(epochs_data, raw_data):
    ica = ICA(method='picard', random_state=23,  max_iter=10000, verbose=True)
    ica.fit(epochs_data)

    ica.exclude = []
    eog_epochs = mne.preprocessing.create_eog_epochs(raw_data, ch_name=["Fp1", "Fp2"])
    eog_indices, eog_scores = ica.find_bads_eog(eog_epochs, ch_name=["Fp1", "Fp2"])

    ica.exclude = eog_indices
    ica.apply(epochs_data, exclude=ica.exclude)
    
    reject_val = []
    reject_val = autoreject.get_rejection_threshold(epochs_data, cv=4, ch_types='eeg')
    reject = dict(eeg=reject_val['eeg'])
    epochs_data.drop_bad(reject=reject)

    print(epochs_data.drop_log_stats())

    del raw_data, ica
    return epochs_data
  1. Dividing each channel into 5 frequency bands and attempt to transform and extract features.
    This is where the problem arises because recordings have varying numbers of epochs, for example, 8x62x800 or 12x62x800.
    Therefore, I am uncertain about the subsequent steps involving transformations or feature extraction. My intention is to apply a wavelet transform and then identify feature such as differential entropy.
    But should it be done for epochs? If I extract features for epochs, I will have, for example, 8 arrays of 5 features for one recording and 12 arrays of 5 features for another, resulting in non-uniformity of inputs for the classifier.
    I tried to determine the TFR from the epochs and then feature value, to have one feature value for the whole signal, but they are very similar to each other, and the classifier does not learn.
    I do this in a loop for each recording separately, but the final table for one participant should contain the features for each recording, so 24x310 (5 freq_bands * 62 channels). For all participants it will be 360x310 for one session.
def freq_bands(epochs, channel_list):
    iter_freqs = [
        ('Delta', 1, 4),
        ('Theta', 4, 8),
        ('Alpha', 8, 14),
        ('Beta', 14, 31),
        ('Gamma', 31, 50)
    ]
    diff_entrops = []

    for ch in channel_list:
        ep = epochs.copy()

        for band, fmin, fmax in iter_freqs:
            ep_copied = ep.copy().filter(fmin, fmax, n_jobs=15,  # use more jobs to speed up.
                       l_trans_bandwidth=1,  # make sure filter params are the same
                       h_trans_bandwidth=1)  # in each band and skip "auto" option.

            ep_copied.pick_channels(ch_names=[ch], ordered=True)

            freqs = np.arange(fmin, fmax, 1)
            n_cycles = freqs / 2.0
            wave = tfr_morlet(inst=ep_copied, freqs=freqs, n_cycles=n_cycles, use_fft=True, return_itc=False, n_jobs=None, average=False)
            wave_df = wave.to_data_frame(index='freq')
            ch_wave_value = np.array(wave_df.loc[:, [ch]].T)

            diff_entrop = differential_entropy(values=ch_wave_value, axis=1)
    diff_entrops.append(diff_entrop)
return diff_entrops
  1. PCA for feature selection.

After running the program, I am only achieving an accuracy of around 0.2, occasionally reaching 0.4. Am I making a mistake in determining the TFR, and does it not equate to the same concept as the wavelet coefficient? Is it not possible to extract features from the TFR? Moreover, if I were to extract features from epochs, how should I address the varying number of features for each recording?

I would greatly appreciate any suggestions, related threads, or articles. I have done extensive research, but I acknowledge the possibility that I might have misunderstood something and made a mistake in my processing approach.

Kind regards.

  • MNE version: 1.4.2
  • operating system: Windows 10

are you trying to learn on 14 participants and predict on the left out one? so it would be between subject classification?
or are you doing it within subject?

I would suggest you look at https://mne.tools/mne-features/ for the types of features you mentioned
and in mne to our example using CSP for motor imagery classification.

https://mne.tools/stable/auto_examples/decoding/decoding_csp_eeg.html#sphx-glr-auto-examples-decoding-decoding-csp-eeg-py

Alex

Good morning,
Exactly right - I intend to train on 14 and test on 1 participant. For each session separately.

I’ve already tested compute_spect_entropy with mne-features, but that didn’t yield good results either. However, is it possible to extract features from TFR, or is this the wrong approach?
I think the CSP example is understandable to me, because training is done directly on epochs and their labels. My issue is the different number of epochs for recordings and the ability to classify them, where I have one label for one recording (several epochs).

Have you perhaps noticed a preprocessing error that perhaps distorts the signal, affecting the low classification score? Maybe am I using the wrong approach for the different number of epochs for the recordings?

Thank you for your reply and best regards!