Dear all, I have two raw data objects that I concatenate and then save. Later on, I load this data again, filter it, set a new reference, apply an ICA solution, interpolate bad channels, and save it once more.
After the second save, the file is twice as big, although technically, not much has changed, has it? (It’s still the same amount of channels and time points in the data)
Luckily I could reproduce this issue with the MNE sample data. concat_raw.fif.gz is around 425MB, but concat_filt_raw.fif.gz is around 930MB.
Why is the second file twice as big?
EDIT: When I change from compressed saving (.fif.gz) to standard saving (.fif), the size of both files is 504.8MB so the compression even makes it worse in one of the cases?
MWE:
import os.path as op
import mne
from pathlib import Path
# example data
sample_dir = Path(mne.datasets.sample.data_path())
sample_fname = sample_dir / 'MEG' / 'sample' / 'sample_audvis_raw.fif'
raw = mne.io.read_raw_fif(sample_fname, preload=True)
# save as concatenated
raw = mne.concatenate_raws([raw, raw])
fname = Path(op.expanduser("~")) / "Desktop" / "concat_raw.fif.gz"
raw.save(fname, overwrite=True)
# filter, then save again
n_jobs = 4
raw = raw.filter(l_freq=0.1, h_freq=None, n_jobs=n_jobs)
raw = raw.filter(l_freq=None, h_freq=40, n_jobs=n_jobs)
raw = raw.interpolate_bads()
raw = raw.set_eeg_reference(
ref_channels="average", projection=False, ch_type="eeg"
)
fname = Path(op.expanduser("~")) / "Desktop" / "concat_filt_raw.fif.gz"
raw.save(fname, overwrite=True)
info on my “real” data:
about 250MB short (16bit) format per file
I concatenate two files → 500MB
I save in single (32bit) format → should be 1000MB, but I use compression: fif.gz → I get 500MB
I later load that FIF data again, process it, and save again in the same way → I get 1000MB instead of again the 500 expected MB
in my own pipeline I read short data, then save it → becomes single … then read it (double while in memory) and filter/etc it … then save it again as single. Yet: the sizes of the initially loaded single data and the then newly saved single data differ by a factor 2x
In the example above I load the data once and save it twice: (a) before filtering etc., (b) after filtering etc. → version (b) is twice as large!
my example from above furthermore only “works” (i.e., does unexpected things) when I use gzip compression .fif.gz … else (.fif) both files (a) and (b) are about the same size as expected
EDIT: I guess in the case of my real data, it could just be that for some reason, the compression is super effective before filtering and all these other processing steps (reducing the file size by 2x) … and then after these steps when saving again, it doesn’t compress all that much (almost nothing). Could that be a possibility? However that would still not explain how in the MWE, compression increases the file size
Can you try to manually zip (and/or gz) both concat_raw.fif and concat_filt_raw.fif (which are about the same size)? When compressed does the second file get much bigger? If not, maybe it’s something we do when creating/compressing the .fif.gz file?
Our compression is extremely inefficient, this could be a quick fix if the library we’re using support other algorithms.
The unfiltered file is still less than half the size of the filtered data. I wonder if it is because the original file is much less accurate (half or single), so when storing that as e.g. double there would be much fewer significant digits than available by the format, so it can be compressed a lot. On the other hand, when you filter that data you will get values that fill all digits of double precision numbers, hence compressing will not work as well. Does this make sense?
I don’t know how I arrived at that file size … I cannot reproduce it anymore.
I have improved the MWE, and I’d be grateful if someone else could run it to verify my results below. You only need a recent MNE installation, and gzip available as a command line tool.
# %%
import os.path as op
import subprocess
from pathlib import Path
import mne
# example data, concatenate 2 raws
sample_dir = Path(mne.datasets.sample.data_path())
sample_fname = sample_dir / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_fname, preload=True)
raw = mne.concatenate_raws([raw, raw])
# save without any more processing
basepath = Path(op.expanduser("~")) / "Desktop"
for ext in [".fif.gz", ".fif"]:
fname = basepath / f"concat_raw{ext}"
raw.save(fname, overwrite=True)
# use OS installed gzip to zip fif -> fif.gz (use _gz in name to distinguish)
fname_gz = basepath / "concat_raw_os-gz.fif"
raw.save(fname_gz, overwrite=True)
cmd = ["gzip", f"{fname_gz}"]
subprocess.run(cmd)
# process the data by a bit, then save again
# file size SHOULD be same as non-processed file above
raw = raw.filter(l_freq=0.1, h_freq=None, n_jobs=4)
raw = raw.filter(l_freq=None, h_freq=40, n_jobs=4)
raw = raw.interpolate_bads()
raw = raw.set_eeg_reference(ref_channels="average", projection=False, ch_type="eeg")
for ext in [".fif.gz", ".fif"]:
fname = basepath / f"concat_filt_raw{ext}"
raw.save(fname, overwrite=True)
# use OS installed gzip to zip fif -> fif.gz (use _gz in name to distinguish)
fname_gz = basepath / "concat_filt_raw_os-gz.fif"
raw.save(fname_gz, overwrite=True)
cmd = ["gzip", f"{fname_gz}"]
subprocess.run(cmd)
After running this, I get the file sizes like this: ls -l *.fif* --block-size=M
482M Dec 14 09:48 concat_filt_raw.fif
444M Dec 14 09:48 concat_filt_raw.fif.gz
443M Dec 14 09:48 concat_filt_raw_os-gz.fif.gz
482M Dec 14 09:47 concat_raw.fif
204M Dec 14 09:47 concat_raw.fif.gz
181M Dec 14 09:47 concat_raw_os-gz.fif.gz
I think my explanation in my previous comment still applies – did you miss it?
I thought that when MNE reads any data, the data are of format double in memory anyway, regardless of whether or not I filter / process in other ways When printing raw._data.dtype, it says float64 directly after loading.
So intuitively I think that you have a point and for some reason the processed data cannot be compressed as well as the unprocessed data … but I don’t get the reason yet.
The reason could be that if you store single-precision data as double-precision data, you will not add any additional precision. Half of the bits will be unused, so that’s why such data can be compressed by a lot. If you apply a filter, the full amount of bits (double-precision) will now be used, because the data is available as double-precision. Hence it cannot be compressed as much anymore.
Ah now I see what you mean, but I don’t think it applies here, because we are using raw.save with the default parameters in our example → and that means fmt='single', so even if after filtering the “information” in the data is now of type double, it will still be saved as single.
In my initial (now crossed out) reply above I somehow assumed that when converting single data to double, the “added zeros after the comma” would be added precision automatically - but I guess that you are right, that these are easily compressed.
So I think what you say makes sense - thanks a lot. I think that sufficiently resolves the puzzle for me