Analysis of EEG Data with Variable-Length Recordings

Dear MNE,

I am currently working on my student graduation paper, and this is my first attempt at EEG preprocessing and analysis. The dataset I have received consists of 3 sessions with 15 participants and 24 recordings per session, each having its corresponding 24 labels. The recordings involve participants watching videos, with each video corresponding to one of 4 emotions that I aim to recognize using a SVM classifier.

The problem I am encountering, and which I have no idea how to solve after long time, is connected to the fact that each of the 24 recordings has varying lengths. For example, for one participant, there are 24 recordings of [62 channels] x [N] (e.g., 62x33601, 62x19001, 62x9601 etc.).

I am writing this post primarily to ask for suggestions on the correct approach for handling signals of different lengths, as well as an evaluation of the processing approach, which I will describe below. Maybe the issue occurs earlier.

  1. Reading a data from MATLAB type of file and creating Raw for each recording:
info = mne.create_info(ch_names=chan_names, sfreq=1000, ch_types='eeg')
raw = mne.io.RawArray(data=data, info=info)
  1. Downsampling data drom 1000Hz to 200Hz:
downsampled_raw = raw.copy().resample(200, npad="auto")
  1. Bandpassing:
bandpassed_raw = downsampled_raw.copy().filter(l_freq=1, h_freq=75)
  1. Creating epochs of a fixed length (4 seconds as authors of the dataset) since the dataset does not provide any information about events or triggers. Just all recording has its emotion label.
epochs = mne.make_fixed_length_epochs(bandpassed_raw, duration=4, preload=True)
cap_montage(epochs)
  1. Detecting EOG artifacts and dropping them + additional dropping by autoreject.
def ica_repair(epochs_data, raw_data):
    ica = ICA(method='picard', random_state=23,  max_iter=10000, verbose=True)
    ica.fit(epochs_data)

    ica.exclude = []
    eog_epochs = mne.preprocessing.create_eog_epochs(raw_data, ch_name=["Fp1", "Fp2"])
    eog_indices, eog_scores = ica.find_bads_eog(eog_epochs, ch_name=["Fp1", "Fp2"])

    ica.exclude = eog_indices
    ica.apply(epochs_data, exclude=ica.exclude)
    
    reject_val = []
    reject_val = autoreject.get_rejection_threshold(epochs_data, cv=4, ch_types='eeg')
    reject = dict(eeg=reject_val['eeg'])
    epochs_data.drop_bad(reject=reject)

    print(epochs_data.drop_log_stats())

    del raw_data, ica
    return epochs_data
  1. Dividing each channel into 5 frequency bands and attempt to transform and extract features.
    This is where the problem arises because recordings have varying numbers of epochs, for example, 8x62x800 or 12x62x800.
    Therefore, I am uncertain about the subsequent steps involving transformations or feature extraction. My intention is to apply a wavelet transform and then identify feature such as differential entropy.
    But should it be done for epochs? If I extract features for epochs, I will have, for example, 8 arrays of 5 features for one recording and 12 arrays of 5 features for another, resulting in non-uniformity of inputs for the classifier.
    I tried to determine the TFR from the epochs and then feature value, to have one feature value for the whole signal, but they are very similar to each other, and the classifier does not learn.
    I do this in a loop for each recording separately, but the final table for one participant should contain the features for each recording, so 24x310 (5 freq_bands * 62 channels). For all participants it will be 360x310 for one session.
def freq_bands(epochs, channel_list):
    iter_freqs = [
        ('Delta', 1, 4),
        ('Theta', 4, 8),
        ('Alpha', 8, 14),
        ('Beta', 14, 31),
        ('Gamma', 31, 50)
    ]
    diff_entrops = []

    for ch in channel_list:
        ep = epochs.copy()

        for band, fmin, fmax in iter_freqs:
            ep_copied = ep.copy().filter(fmin, fmax, n_jobs=15,  # use more jobs to speed up.
                       l_trans_bandwidth=1,  # make sure filter params are the same
                       h_trans_bandwidth=1)  # in each band and skip "auto" option.

            ep_copied.pick_channels(ch_names=[ch], ordered=True)

            freqs = np.arange(fmin, fmax, 1)
            n_cycles = freqs / 2.0
            wave = tfr_morlet(inst=ep_copied, freqs=freqs, n_cycles=n_cycles, use_fft=True, return_itc=False, n_jobs=None, average=False)
            wave_df = wave.to_data_frame(index='freq')
            ch_wave_value = np.array(wave_df.loc[:, [ch]].T)

            diff_entrop = differential_entropy(values=ch_wave_value, axis=1)
    diff_entrops.append(diff_entrop)
return diff_entrops
  1. PCA for feature selection.

After running the program, I am only achieving an accuracy of around 0.2, occasionally reaching 0.4. Am I making a mistake in determining the TFR, and does it not equate to the same concept as the wavelet coefficient? Is it not possible to extract features from the TFR? Moreover, if I were to extract features from epochs, how should I address the varying number of features for each recording?

I would greatly appreciate any suggestions, related threads, or articles. I have done extensive research, but I acknowledge the possibility that I might have misunderstood something and made a mistake in my processing approach.

Kind regards.

  • MNE version: 1.4.2
  • operating system: Windows 10

are you trying to learn on 14 participants and predict on the left out one? so it would be between subject classification?
or are you doing it within subject?

I would suggest you look at https://mne.tools/mne-features/ for the types of features you mentioned
and in mne to our example using CSP for motor imagery classification.

https://mne.tools/stable/auto_examples/decoding/decoding_csp_eeg.html#sphx-glr-auto-examples-decoding-decoding-csp-eeg-py

Alex

1 Like

Good morning,
Exactly right - I intend to train on 14 and test on 1 participant. For each session separately.

I’ve already tested compute_spect_entropy with mne-features, but that didn’t yield good results either. However, is it possible to extract features from TFR, or is this the wrong approach?
I think the CSP example is understandable to me, because training is done directly on epochs and their labels. My issue is the different number of epochs for recordings and the ability to classify them, where I have one label for one recording (several epochs).

Have you perhaps noticed a preprocessing error that perhaps distorts the signal, affecting the low classification score? Maybe am I using the wrong approach for the different number of epochs for the recordings?

Thank you for your reply and best regards!

Hello,

I am facing the same issue. I have taken EEG of 40 users - pre and post but is of varying lengths. Do you think I should truncate some seconds to make it the same lenght? What did you do? Please can you help?

Hello!

I am happy to help, maybe my approach will be useful to you, because I solved my problem and achieved more than 90% accuracy. The approach seems to be correct, but it may depend on the set and the data.

As for the lengths of the recordings, I gave up completely on shortening or adjusting the length of the recordings to an identical length. This could have resulted in the loss of key data, such as recording emotions at a particular moment, perhaps at the end of the recording. Some of the recordings were very short, then by shortening the others a huge amount of valuable information would be lost. And by lengthening the shorter of the recordings, e.g. using averaging or zero-padding, this could have resulted in a false impact on the results - admittedly, not true recordings, but artificial lengthening. One can use those approaches and they can be right choices. Perhaps if there are small differences in the length of the recordings, you may decide to truncate. I resigned because of the significant differences in length.

In my case the solution to the problem turned out to be a transformation of the domain to no longer consider in the time axis, but another, such as frequency. I recommend reading about domains: frequency domain, so that it can be analyzed as a spectrum, that is, a representation of amplitude depending on frequency. The analysis can also be based on a time-frequency approach.

So, in my study, I used Fast Fourier Transform (FFT) to transform to frequency domain on 4s epochs.

Having the epochs, I first separated the bands (delta, theta, alpha, gamma, delta) and calculated the PSD values using the PSD function, which uses the Fourier transform with the welch method to transform into the frequency domain. Each PSD value described each frequency at each epoch for each channel.

At this point, it was no longer the lengths of the recordings that mattered, but the frequency bands. The point of view was changed from lengths to frequency bands, which is called the transformation to the frequency domain.

Here I described the change in approach in this case, although I also changed other elements from the base post. Therefore, it is important to experiment with methods on your collection until you find the optimal one. I hope that I have helped.

Great answer, thank you for sharing your approach! However, does this not imply that you did shorten all recordings/conditions to the same length of 4s (if so, how did you choose which 4s segments you use)? Or did you use the entire time span, but segmented it into (possibly overlapping) 4s epochs, which you then averaged after FFT?

Hello,
I chose the approach after reading a lot of research and help in the forums, and hope that it will not cause trouble for someone.
You are right that the number of epochs still varies due to the different lengths of recordings. I calculated PSD values for each epoch, channel and frequency band averaging them. The resulting values were later concatenated with each other. However, after that, I additionally performed feature selection using PCA and cumulative variance above 99% to obtain a subset of the most relevant features. The number of PCA components selected was passed in the parameter and used to create a set of significant features. I think this provided the same amount and such a set was finally passed for classification. I hope this is correct.
My research has already been completed and my work accepted, but thank you for your help, as maybe someone else is encountering these problems and help is possible using my example. Thank you

1 Like