Is there a way to speed up loading EDF files?

JohnAtl · December 14, 2023, 4:48pm

Is there a way to speed up reading EDF files?

Our files are generally about 500MB, sleep studies sampled at 1000Hz over a period of about 8 hours.

Loading these files using mne.io.read_raw takes about 5 minutes, or more. I just tried one that took 11 minutes to open.

I see that there are ways to exclude various parts of the file, but I’m not sure if that would help. I just need the EEG data; and actually a subset of that, 6 eeg channels, 2 eog, and 1 emg.

I see that there is a PyEDFLib, but I really don’t want to reinvent the wheel.

Should be good on hardware: i9-13900K, 128GB ram, 3x Crucial P5 Plus 2TB in ZFS stripe.

Thanks for any suggestions!

mscheltienne · December 14, 2023, 4:50pm

Hello,

Just to be sure, those files are not on a network drive and you are thus limited by the read speed of your network?

Mathieu

JohnAtl · December 14, 2023, 4:59pm

No, they are local, on the ZFS stripe.

cbrnr · December 14, 2023, 8:04pm

You might want to try edfio.

richard · December 15, 2023, 6:31am

Isn’t that the default in MNE these days?

cbrnr · December 15, 2023, 6:49am

No, just for exporting. We still use our own reader. I don’t know if edfio is faster than our readers, but maybe @JohnAtl can report back here?

JohnAtl · December 15, 2023, 9:02pm

It is stupid fast.
It reads a 489MB file in about 2 seconds.

The following takes a few seconds. It seems to be reading the whole file, as evidenced by the size reported.

❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from edfio import read_edf
>>> edf=read_edf("data/raw/00032235-110895_ID00049/00032235-110895[001].edf")
>>> from pympler import asizeof
>>> asizeof.asizeof(edf)
522265160

This takes about 5.5 minutes:

❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mne
>>> edf=mne.io.read_raw("data/raw/00032235-110895_ID00049/00032235-110895[001].edf",preload=True)
Extracting EDF parameters from /nvme/work/Neurogram/Sleep/data/raw/00032235-110895_ID00049/00032235-110895[001].edf...
EDF file detected
<stdin>:1: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 41067999  =      0.000 ... 41067.999 secs...

Here’s the file:

❯ ll data/raw/00032235-110895_ID00049/00032235-110895\[001\].edf 
-rw-r--r-- 1 john john 499M Dec 11 09:19 'data/raw/00032235-110895_ID00049/00032235-110895[001].edf'

I tried using timeit to be a little more precise, but it didn’t play well.

cbrnr · December 16, 2023, 10:30am

Thanks, this is great to hear! We should probably think about including edfio as an additional backend for our reader then…

Although I’m a bit surprised that our reader is so slow. I regularly load 300MB+ BDF files and never noticed any slowdown (a couple of seconds at most).

cbrnr · December 16, 2023, 10:38am

I just tried reading a BDF file (406MB) with mne.io.read_raw() (preload=True of course), and it took 969 ms ± 9.61 ms. So I think it must be something on your end. To make sure it is not due to some property of your file, could you share one of your datasets?

I also tested with this 740MB EDF file, loading takes around 5.2s with mne.io.read_raw() and 230ms with edfio.read_edf(). So although edfio is indeed much faster (about 25x for this file), the MNE reader is far from unusable.

hofaflo · December 16, 2023, 4:04pm

Note that mne.io.read_raw_edf() spends most of that time resampling signals to a common sampling frequency, which does not happen in edfio.read_edf(). For a similarly sized file with uniform sampling frequencies, the speedup goes down to ~2.5x.

JohnAtl · December 18, 2023, 3:19pm

Sounds like what’s happening here.
I seriously doubt their eeg is sampling at 1kHz, but for physiological signals, that’s likely.

As a workaround, maybe I could use edfio to read the file, downsample everything to 100Hz, save to a temp file, then open with mne.

JohnAtl · December 18, 2023, 9:36pm

I dropped the channels/signals with extreme differences in sampling frequency (leaving 100Hz and 200Hz), and the edf file loads in about 8 seconds.

Going forward, we will drop the channels we don’t need. If we need, say, the 1kHz ECG channel, we can save it to a separate edf file, read it, and use as-is or downsample as needed.

Thanks for the help!

input_file = "data/01402879-111133_ID00097/01402879-111133[001].edf"
output_file = "data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf"
edf = read_edf(input_file)
for signal in edf.signals:
    print(f"{signal.label:20s}: {signal.sampling_frequency} ", end=" ")
    # if signal.sampling_frequency != 200:
    if (
        signal.label == "Snore"  # ie. 500Hz
        or signal.label == "ECG IIHF"  # ie. 1000Hz
        or signal.sampling_frequency == 10.0
        or signal.sampling_frequency == 1.0
    ):
        try:
            edf.drop_signals(signal.label)
        except:
            print("duplicate ", end=" ")
        print("dropped")
    else:
        print("")
edf.write(output_file)

>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 7404999  =      0.000 ... 37024.995 secs...
8.192410837858915
>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001].edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001].edf...
EDF file detected
<timeit-src>:6: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 37024999  =      0.000 ... 37024.999 secs...
419.15444047003984

Topic		Replies	Views
Read edf and exclude annotations Support & Discussions eeg	3	355	June 1, 2021
Number of samples when reading edf file Support & Discussions preprocessing , eeg	7	242	February 20, 2024
Reading and merging events from seperate files with raw data Support & Discussions eeg	2	292	March 9, 2022
Reading .rec EEG file Support & Discussions eeg	4	596	August 7, 2021
reading pre saved .fif is slower than parse raw .edf Support & Discussions data-import	2	318	January 7, 2022

Is there a way to speed up loading EDF files?

Related topics