Is there a way to speed up loading EDF files?

Is there a way to speed up reading EDF files?

Our files are generally about 500MB, sleep studies sampled at 1000Hz over a period of about 8 hours.

Loading these files using mne.io.read_raw takes about 5 minutes, or more. I just tried one that took 11 minutes to open.

I see that there are ways to exclude various parts of the file, but I’m not sure if that would help. I just need the EEG data; and actually a subset of that, 6 eeg channels, 2 eog, and 1 emg.

I see that there is a PyEDFLib, but I really don’t want to reinvent the wheel.

Should be good on hardware: i9-13900K, 128GB ram, 3x Crucial P5 Plus 2TB in ZFS stripe.

Thanks for any suggestions!

Hello,

Just to be sure, those files are not on a network drive and you are thus limited by the read speed of your network?

Mathieu

No, they are local, on the ZFS stripe.

You might want to try edfio.

Isn’t that the default in MNE these days?

No, just for exporting. We still use our own reader. I don’t know if edfio is faster than our readers, but maybe @JohnAtl can report back here?

1 Like

It is stupid fast.
It reads a 489MB file in about 2 seconds.

The following takes a few seconds. It seems to be reading the whole file, as evidenced by the size reported.

❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from edfio import read_edf
>>> edf=read_edf("data/raw/00032235-110895_ID00049/00032235-110895[001].edf")
>>> from pympler import asizeof
>>> asizeof.asizeof(edf)
522265160

This takes about 5.5 minutes:

❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mne
>>> edf=mne.io.read_raw("data/raw/00032235-110895_ID00049/00032235-110895[001].edf",preload=True)
Extracting EDF parameters from /nvme/work/Neurogram/Sleep/data/raw/00032235-110895_ID00049/00032235-110895[001].edf...
EDF file detected
<stdin>:1: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 41067999  =      0.000 ... 41067.999 secs...

Here’s the file:

❯ ll data/raw/00032235-110895_ID00049/00032235-110895\[001\].edf 
-rw-r--r-- 1 john john 499M Dec 11 09:19 'data/raw/00032235-110895_ID00049/00032235-110895[001].edf'

I tried using timeit to be a little more precise, but it didn’t play well.

2 Likes

Thanks, this is great to hear! We should probably think about including edfio as an additional backend for our reader then…

Although I’m a bit surprised that our reader is so slow. I regularly load 300MB+ BDF files and never noticed any slowdown (a couple of seconds at most).

1 Like

I just tried reading a BDF file (406MB) with mne.io.read_raw() (preload=True of course), and it took 969 ms ± 9.61 ms. So I think it must be something on your end. To make sure it is not due to some property of your file, could you share one of your datasets?

I also tested with this 740MB EDF file, loading takes around 5.2s with mne.io.read_raw() and 230ms with edfio.read_edf(). So although edfio is indeed much faster (about 25x for this file), the MNE reader is far from unusable.

3 Likes

Note that mne.io.read_raw_edf() spends most of that time resampling signals to a common sampling frequency, which does not happen in edfio.read_edf(). For a similarly sized file with uniform sampling frequencies, the speedup goes down to ~2.5x.

4 Likes

Sounds like what’s happening here.
I seriously doubt their eeg is sampling at 1kHz, but for physiological signals, that’s likely.

As a workaround, maybe I could use edfio to read the file, downsample everything to 100Hz, save to a temp file, then open with mne.

I dropped the channels/signals with extreme differences in sampling frequency (leaving 100Hz and 200Hz), and the edf file loads in about 8 seconds.

Going forward, we will drop the channels we don’t need. If we need, say, the 1kHz ECG channel, we can save it to a separate edf file, read it, and use as-is or downsample as needed.

Thanks for the help!

input_file = "data/01402879-111133_ID00097/01402879-111133[001].edf"
output_file = "data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf"
edf = read_edf(input_file)
for signal in edf.signals:
    print(f"{signal.label:20s}: {signal.sampling_frequency} ", end=" ")
    # if signal.sampling_frequency != 200:
    if (
        signal.label == "Snore"  # ie. 500Hz
        or signal.label == "ECG IIHF"  # ie. 1000Hz
        or signal.sampling_frequency == 10.0
        or signal.sampling_frequency == 1.0
    ):
        try:
            edf.drop_signals(signal.label)
        except:
            print("duplicate ", end=" ")
        print("dropped")
    else:
        print("")
edf.write(output_file)
>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 7404999  =      0.000 ... 37024.995 secs...
8.192410837858915
>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001].edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001].edf...
EDF file detected
<timeit-src>:6: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 37024999  =      0.000 ... 37024.999 secs...
419.15444047003984
2 Likes