fif.gz compression less effective after filtering?

sappelhoff · December 13, 2021, 11:35am

MNE version: 0.24.1
operating system: Ubuntu 18.04

Dear all, I have two raw data objects that I concatenate and then save. Later on, I load this data again, filter it, set a new reference, apply an ICA solution, interpolate bad channels, and save it once more.

After the second save, the file is twice as big, although technically, not much has changed, has it? (It’s still the same amount of channels and time points in the data)

Luckily I could reproduce this issue with the MNE sample data. concat_raw.fif.gz is around 425MB, but concat_filt_raw.fif.gz is around 930MB.

Why is the second file twice as big?

EDIT: When I change from compressed saving (.fif.gz) to standard saving (.fif), the size of both files is 504.8MB so the compression even makes it worse in one of the cases?

MWE:

import os.path as op

import mne
from pathlib import Path


# example data
sample_dir = Path(mne.datasets.sample.data_path())
sample_fname = sample_dir / 'MEG' / 'sample' / 'sample_audvis_raw.fif'
raw = mne.io.read_raw_fif(sample_fname, preload=True)

# save as concatenated
raw = mne.concatenate_raws([raw, raw])
fname = Path(op.expanduser("~")) / "Desktop" / "concat_raw.fif.gz"
raw.save(fname, overwrite=True)

# filter, then save again
n_jobs = 4
raw = raw.filter(l_freq=0.1, h_freq=None, n_jobs=n_jobs)
raw = raw.filter(l_freq=None, h_freq=40, n_jobs=n_jobs)
raw = raw.interpolate_bads()
raw = raw.set_eeg_reference(
    ref_channels="average", projection=False, ch_type="eeg"
)

fname = Path(op.expanduser("~")) / "Desktop" / "concat_filt_raw.fif.gz"
raw.save(fname, overwrite=True)

info on my “real” data:

about 250MB short (16bit) format per file
I concatenate two files → 500MB
I save in single (32bit) format → should be 1000MB, but I use compression: fif.gz → I get 500MB
I later load that FIF data again, process it, and save again in the same way → I get 1000MB instead of again the 500 expected MB

richard · December 13, 2021, 11:55am

Maybe related?

sappelhoff · December 13, 2021, 12:51pm

Thanks! But I don’t think so:

in my own pipeline I read short data, then save it → becomes single … then read it (double while in memory) and filter/etc it … then save it again as single. Yet: the sizes of the initially loaded single data and the then newly saved single data differ by a factor 2x
In the example above I load the data once and save it twice: (a) before filtering etc., (b) after filtering etc. → version (b) is twice as large!
my example from above furthermore only “works” (i.e., does unexpected things) when I use gzip compression .fif.gz … else (.fif) both files (a) and (b) are about the same size as expected

EDIT: I guess in the case of my real data, it could just be that for some reason, the compression is super effective before filtering and all these other processing steps (reducing the file size by 2x) … and then after these steps when saving again, it doesn’t compress all that much (almost nothing). Could that be a possibility? However that would still not explain how in the MWE, compression increases the file size

cbrnr · December 13, 2021, 3:22pm

Can you try to manually zip (and/or gz) both concat_raw.fif and concat_filt_raw.fif (which are about the same size)? When compressed does the second file get much bigger? If not, maybe it’s something we do when creating/compressing the .fif.gz file?

sappelhoff · December 13, 2021, 4:58pm

good idea, I just tired gzip <fname.fif> - it results in:

concat_raw.fif → 187MB when gzipped (425MB when done in MNE-Python)
concat_filt_raw.fif → 464MB when gzipped (930MB when done in MNE-Python)

cbrnr · December 13, 2021, 6:58pm

This is interesting for two reasons:

Our compression is extremely inefficient, this could be a quick fix if the library we’re using support other algorithms.
The unfiltered file is still less than half the size of the filtered data. I wonder if it is because the original file is much less accurate (half or single), so when storing that as e.g. double there would be much fewer significant digits than available by the format, so it can be compressed a lot. On the other hand, when you filter that data you will get values that fill all digits of double precision numbers, hence compressing will not work as well. Does this make sense?

sappelhoff · December 13, 2021, 10:05pm

but concat_filt_raw.fif.gz is around 930MB.

I don’t know how I arrived at that file size … I cannot reproduce it anymore.

I have improved the MWE, and I’d be grateful if someone else could run it to verify my results below. You only need a recent MNE installation, and gzip available as a command line tool.

# %%
import os.path as op
import subprocess
from pathlib import Path

import mne

# example data, concatenate 2 raws
sample_dir = Path(mne.datasets.sample.data_path())
sample_fname = sample_dir / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_fname, preload=True)
raw = mne.concatenate_raws([raw, raw])

# save without any more processing
basepath = Path(op.expanduser("~")) / "Desktop"
for ext in [".fif.gz", ".fif"]:
    fname = basepath / f"concat_raw{ext}"
    raw.save(fname, overwrite=True)

# use OS installed gzip to zip fif -> fif.gz (use _gz in name to distinguish)
fname_gz = basepath / "concat_raw_os-gz.fif"
raw.save(fname_gz, overwrite=True)
cmd = ["gzip", f"{fname_gz}"]
subprocess.run(cmd)

# process the data by a bit, then save again
# file size SHOULD be same as non-processed file above
raw = raw.filter(l_freq=0.1, h_freq=None, n_jobs=4)
raw = raw.filter(l_freq=None, h_freq=40, n_jobs=4)
raw = raw.interpolate_bads()
raw = raw.set_eeg_reference(ref_channels="average", projection=False, ch_type="eeg")

for ext in [".fif.gz", ".fif"]:
    fname = basepath / f"concat_filt_raw{ext}"
    raw.save(fname, overwrite=True)

# use OS installed gzip to zip fif -> fif.gz (use _gz in name to distinguish)
fname_gz = basepath / "concat_filt_raw_os-gz.fif"
raw.save(fname_gz, overwrite=True)
cmd = ["gzip", f"{fname_gz}"]
subprocess.run(cmd)

After running this, I get the file sizes like this: ls -l *.fif* --block-size=M

482M Dez 13 22:55 concat_filt_raw.fif
444M Dez 13 22:55 concat_filt_raw.fif.gz
443M Dez 13 22:55 concat_filt_raw_os-gz.fif.gz
482M Dez 13 22:54 concat_raw.fif
204M Dez 13 22:54 concat_raw.fif.gz
180M Dez 13 22:54 concat_raw_os-gz.fif.gz

Observations:

The .fif files have equal sizes as expected
The MNE-compressed .fif.gz files are very different: concat_filt_raw is much bigger than concat_raw
The gzip-tool-compressed os-gz.fif.gz are approximately the same size as the MNE-compressed ones

I am still bewildered how observation 2 comes about. It’s the same I also see in my real data. Can someone reproduce this?

cbrnr · December 14, 2021, 8:50am

Yes, I can reproduce it:

482M Dec 14 09:48 concat_filt_raw.fif
444M Dec 14 09:48 concat_filt_raw.fif.gz
443M Dec 14 09:48 concat_filt_raw_os-gz.fif.gz
482M Dec 14 09:47 concat_raw.fif
204M Dec 14 09:47 concat_raw.fif.gz
181M Dec 14 09:47 concat_raw_os-gz.fif.gz

I think my explanation in my previous comment still applies – did you miss it?

sappelhoff · December 14, 2021, 9:25am

oups, yes!

I thought that when MNE reads any data, the data are of format double in memory anyway, regardless of whether or not I filter / process in other ways When printing raw._data.dtype, it says float64 directly after loading.

So intuitively I think that you have a point and for some reason the processed data cannot be compressed as well as the unprocessed data … but I don’t get the reason yet.

cbrnr · December 14, 2021, 9:30am

The reason could be that if you store single-precision data as double-precision data, you will not add any additional precision. Half of the bits will be unused, so that’s why such data can be compressed by a lot. If you apply a filter, the full amount of bits (double-precision) will now be used, because the data is available as double-precision. Hence it cannot be compressed as much anymore.

sappelhoff · December 14, 2021, 9:52am

Ah now I see what you mean, but I don’t think it applies here, because we are using raw.save with the default parameters in our example → and that means fmt='single', so even if after filtering the “information” in the data is now of type double, it will still be saved as single.

In my initial (now crossed out) reply above I somehow assumed that when converting single data to double, the “added zeros after the comma” would be added precision automatically - but I guess that you are right, that these are easily compressed.

So I think what you say makes sense - thanks a lot. I think that sufficiently resolves the puzzle for me

cbrnr · December 14, 2021, 9:57am

Not sure if this really works out if all saved files contain single data though…

Topic		Replies	Views
FIF-File 2x disk-size after first loading Support & Discussions	5	239	March 21, 2021
Data size not consistent with its' true size Support & Discussions	12	512	August 14, 2021
combining/splitting fif files Mailing List Archive (read-only) list-archive	7	370	February 18, 2009
Changed file size after raw.save and write_to_bids Support & Discussions	1	56	February 1, 2024
IOError: 2GB file size limit reached. Support for larger raw files will be added in the future. Mailing List Archive (read-only) list-archive	12	290	March 20, 2014

fif.gz compression less effective after filtering?

Related topics