Our files are generally about 500MB, sleep studies sampled at 1000Hz over a period of about 8 hours.
Loading these files using mne.io.read_raw takes about 5 minutes, or more. I just tried one that took 11 minutes to open.
I see that there are ways to exclude various parts of the file, but I’m not sure if that would help. I just need the EEG data; and actually a subset of that, 6 eeg channels, 2 eog, and 1 emg.
I see that there is a PyEDFLib, but I really don’t want to reinvent the wheel.
Should be good on hardware: i9-13900K, 128GB ram, 3x Crucial P5 Plus 2TB in ZFS stripe.
It is stupid fast.
It reads a 489MB file in about 2 seconds.
The following takes a few seconds. It seems to be reading the whole file, as evidenced by the size reported.
❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from edfio import read_edf
>>> edf=read_edf("data/raw/00032235-110895_ID00049/00032235-110895[001].edf")
>>> from pympler import asizeof
>>> asizeof.asizeof(edf)
522265160
This takes about 5.5 minutes:
❯ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mne
>>> edf=mne.io.read_raw("data/raw/00032235-110895_ID00049/00032235-110895[001].edf",preload=True)
Extracting EDF parameters from /nvme/work/Neurogram/Sleep/data/raw/00032235-110895_ID00049/00032235-110895[001].edf...
EDF file detected
<stdin>:1: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 41067999 = 0.000 ... 41067.999 secs...
Here’s the file:
❯ ll data/raw/00032235-110895_ID00049/00032235-110895\[001\].edf
-rw-r--r-- 1 john john 499M Dec 11 09:19 'data/raw/00032235-110895_ID00049/00032235-110895[001].edf'
I tried using timeit to be a little more precise, but it didn’t play well.
Thanks, this is great to hear! We should probably think about including edfio as an additional backend for our reader then…
Although I’m a bit surprised that our reader is so slow. I regularly load 300MB+ BDF files and never noticed any slowdown (a couple of seconds at most).
I just tried reading a BDF file (406MB) with mne.io.read_raw() (preload=True of course), and it took 969 ms ± 9.61 ms. So I think it must be something on your end. To make sure it is not due to some property of your file, could you share one of your datasets?
I also tested with this 740MB EDF file, loading takes around 5.2s with mne.io.read_raw() and 230ms with edfio.read_edf(). So although edfio is indeed much faster (about 25x for this file), the MNE reader is far from unusable.
Note that mne.io.read_raw_edf() spends most of that time resampling signals to a common sampling frequency, which does not happen in edfio.read_edf(). For a similarly sized file with uniform sampling frequencies, the speedup goes down to ~2.5x.
I dropped the channels/signals with extreme differences in sampling frequency (leaving 100Hz and 200Hz), and the edf file loads in about 8 seconds.
Going forward, we will drop the channels we don’t need. If we need, say, the 1kHz ECG channel, we can save it to a separate edf file, read it, and use as-is or downsample as needed.
Thanks for the help!
input_file = "data/01402879-111133_ID00097/01402879-111133[001].edf"
output_file = "data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf"
edf = read_edf(input_file)
for signal in edf.signals:
print(f"{signal.label:20s}: {signal.sampling_frequency} ", end=" ")
# if signal.sampling_frequency != 200:
if (
signal.label == "Snore" # ie. 500Hz
or signal.label == "ECG IIHF" # ie. 1000Hz
or signal.sampling_frequency == 10.0
or signal.sampling_frequency == 1.0
):
try:
edf.drop_signals(signal.label)
except:
print("duplicate ", end=" ")
print("dropped")
else:
print("")
edf.write(output_file)
>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001]-resampled.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 7404999 = 0.000 ... 37024.995 secs...
8.192410837858915
>>> timeit.timeit('edf=mne.io.read_raw("data/01402879-111133_ID00097/01402879-111133[001].edf",preload=True)',setup="import mne",number=1)
Extracting EDF parameters from /tank/tmp/Sleep/data/01402879-111133_ID00097/01402879-111133[001].edf...
EDF file detected
<timeit-src>:6: RuntimeWarning: Channel names are not unique, found duplicates for: {'Flow Patient', 'Snore'}. Applying running numbers for duplicates.
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 37024999 = 0.000 ... 37024.999 secs...
419.15444047003984