Indexing a matrix of XXX elements would incur an in integer overflow in LAPACK

Hi!

I have a pre-processing pipeline file I use for MEG. I recently created a new virtual environment to work on new data:

  • Python 3.12.4
  • MNE 1.8.0
  • Scipy 1.13.1
  • Numpy 1.26.4

While trying to fit my data to an ICA, I got an error, executing the last line of the following:

ica = mne.preprocessing.ICA(n_components = 20, random_state = 97, method = "picard", fit_params = dict(ortho = True, extended = True))
ica.fit(raw_for_ICA, picks = "meg")

The error is:

File <decorator-gen-470>:10, in fit(self, inst, picks, start, stop, decim, reject, flat, tstep, reject_by_annotation, verbose)
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/preprocessing/ica.py:736, in ICA.fit(self, inst, picks, start, stop, decim, reject, flat, tstep, reject_by_annotation, verbose)
    733 self.ch_names = self.info["ch_names"]
    735 if isinstance(inst, BaseRaw):
--> 736     self._fit_raw(
    737         inst,
    738         picks,
    739         start,
    740         stop,
    741         decim,
    742         reject,
    743         flat,
    744         tstep,
    745         reject_by_annotation,
    746         verbose,
    747     )
    748 else:
    749     assert isinstance(inst, BaseEpochs)
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/preprocessing/ica.py:813, in ICA._fit_raw(self, raw, picks, start, stop, decim, reject, flat, tstep, reject_by_annotation, verbose)
    810     self.reject_ = None
    812 self.n_samples_ = data.shape[1]
--> 813 self._fit(data, "raw")
    815 return self
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/preprocessing/ica.py:898, in ICA._fit(self, data, fit_type)
    895 data = self._pre_whiten(data)
    897 pca = _PCA(n_components=self._max_pca_components, whiten=True)
--> 898 data = pca.fit_transform(data.T)
    899 use_ev = pca.explained_variance_ratio_
    900 n_pca = self.n_pca_components
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/utils/numerics.py:849, in _PCA.fit_transform(self, X, y)
    847 def fit_transform(self, X, y=None):
    848     X = X.copy()
--> 849     U, S, _ = self._fit(X)
    850     U = U[:, : self.n_components_]
    852     if self.whiten:
    853         # X_new = X * V / S * sqrt(n_samples) = U * sqrt(n_samples)
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/utils/numerics.py:890, in _PCA._fit(self, X)
    887 self.mean_ = np.mean(X, axis=0)
    888 X -= self.mean_
--> 890 U, S, V = _safe_svd(X, full_matrices=False)
    891 # flip eigenvectors' sign to enforce deterministic output
    892 U, V = svd_flip(U, V)
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/mne/fixes.py:94, in _safe_svd(A, **kwargs)
     92     raise ValueError("Cannot set overwrite_a=True with this function")
     93 try:
---> 94     return linalg.svd(A, **kwargs)
     95 except np.linalg.LinAlgError as exp:
     96     from .utils import warn
 
File ~/.conda/envs/bodylingual/lib/python3.12/site-packages/scipy/linalg/_decomp_svd.py:130, in svd(a, full_matrices, compute_uv, overwrite_a, check_finite, lapack_driver)
    128         sz = max(m * min_mn, n * min_mn)
    129         if max(m * min_mn, n * min_mn) > numpy.iinfo(numpy.int32).max:
--> 130             raise ValueError(f"Indexing a matrix of {sz} elements would "
    131                               "incur an in integer overflow in LAPACK.")
    133 funcs = (lapack_driver, lapack_driver + '_lwork')
    134 gesXd, gesXd_lwork = get_lapack_funcs(funcs, (a1,), ilp64='preferred')
 
ValueError: Indexing a matrix of 2212380000 elements would incur an in integer overflow in LAPACK.

From my understanding, I get a matrix with 2,221,380,000 elements, which exceeds the limits of a 32-bit integer. Is there any way to use 64-bit integers to prevent this from happening?

Best,

That’s really interesting, I’ve never encountered this issue. What are the dimensions of your data?

Aside from that, yes, you can use ILP64 to get 64-bit integers for indexing. There are several ways to go about this, most of which will involve compiling the library yourself. However, since you are using Anaconda, you could try switching to MKL (which only works if you have an Intel CPU I think). MKL seems to support ILP64, I’m not sure how to enable it for SciPy’s SVD though.

PS: Maybe this thread is related and/or contains useful information for your issue? Pycharm scipy SVD returning error code without message · Issue #14001 · scipy/scipy · GitHub

Hi again!

My raw_for_ICA data consists in 7 230 000 time points for 306 MEG channels (hence 2 221 380 000 elements). The data is cadenced at 1000 Hz.

I haven’t tested it personally, but a colleague who had a very similar error ran his code again using another environment with MNE 1.5.1, Scipy version 1.11.3 and and Numpy 1.26.4, and it worked.

Digging a bit, i noticed that the specific ValueError message that I get didn’t exist before Scipy v 1.13 (source)

I will probably just downgrade for this to work properly, though I am afraid that maybe that would just provoke a silent issue that would end up messing with the data later on.

Two hours of high-density MEG is a lot of data. Do you really need the sampling frequency of 1000Hz? If not, I recommend downsampling to e.g. 256Hz (or possibly even 128Hz), since this significantly reduces the number of data points. Depending on what the purpose of ICA is, you might also want to remove segments of data that do not contain information of interest (e.g. breaks, inter-trial periods, etc.).

That’s a good point, we didn’t really think of that in our pipeline (we wanted to keep the data as untouched as possible) and were looking for ways to make it faster.

I get why you are suggesting 256 Hz, as it is a power of two - that being said, it might be even faster to go for 250 Hz (as we can keep 1 out of 4 time points). Is there any advantage going for 256 rather than 250 in that case?

Coming back to the problem itself, do you think it might be worth investigating? We have used the same pipeline for previous pre-processing, and I am wondering if what might have been a silent integer overflow could translate to erroneous data (typically, positive numbers becoming negative, which in the case of matrices would overwrite some of the trailing data).

Thank you !

No, 256Hz was just an arbitrary (but very common) frequency that I use quite often, but mainly because my amplifier records at 512Hz (or 1024Hz). If you already have data at 1000Hz, it makes more sense to downsample to 250Hz or 125Hz, although I’d say it doesn’t really matter at the end of the day. However, do not just pick every fourth (or eight) sample, you absolutely need to properly downsample your data (which involves an anti-aliasing filter before picking samples).

Yes, if you have used this pipeline previously and didn’t get any errors, I would double-check that there weren’t any silent overflows. However, I don’t really know how to best debug this, but it might be worth asking the SciPy devs. @larsoner, do you have any ideas? It might be interesting to see how EEGLAB folks have implemented their ICA and if it can handle such a large data set (I’m thinking maybe you can do it block-wise or something).

1 Like

After running it, I confirm that downgrading Scipy to v1.12 makes the error message disappear, and the code executes without any problem.

Because now I have doubts regarding the integrity of the pre-processed data, I will, as you suggest, open an issue on the SciPy GitHub, and in parallel I may try to rerun the pipeline with downsampled data, 250 Hz would be enough. It makes me wonder if we missed something in previous experiments, though.

Thanks a lot for your help on this.

1 Like

No problem. It would be great if you shared the link to the discussion so we can follow along.

Of course! Here it is.

In many preprocessing steps the aliasing that occurs without properly low-pass filtering is a concern, but in the case of ICA IIRC because of how the temporal information is used the folks who implemented fit(..., decim=...) intentionally chose not to low-pass filter before picking every Nth sample. At least that’s what I remember going from memory…

But in any case yes you should also be able to safely raw.resample assuming you’re careful about artifacts potentially spreading from the resampling process itself. But if you’re trying to remove those anyway you maybe don’t care if they spread. This does interact with the idea of removing segments you’re not interested in, though, as well as [WIP] ENH: resampling with annotations by jasmainak · Pull Request #11408 · mne-tools/mne-python · GitHub.

@cbrnr brought up the idea on SciPy about potentially using the SVD of the data covariance rather than the SVD of the entire data array… could work IIRC assuming we remove the (channel-wise) mean of the data before computing the SVD, which we might? Could be worth investigating because although there is some precision loss it would probably be much faster.

If you can find it, I would be very interested, because although ICA has no notion of temporal information (all samples are just bunched together), I don’t think this means you can just pick every Nth sample and that’s it.

Should we open an issue to collect ideas? Now that I think of it, maybe I did run into this issue but didn’t notice, because there was no error message. Zero-meaning channels should be no big deal, since that’s a basic property of EEG (and MEG) signals anyway. Furthermore, ICA needs a high-pass filter to work, so channels should already be zero mean (but it would be good to check). Not only would computing eigenvalues on the covariance matrix be much faster, but it would also solve the memory issue.

I’m not sure about the precision loss you’re mentioning, it seems like SVD could be numerically more stable in some situations, but otherwise these algorithms should return identical results.

Another thought: we could (and probably should) just outsource PCA to a package like Scikit-learn. For example, their sklearn.decomposition.PCA has a number of solvers, among others svd_solver="covariance_eigh" which we’ve been discussing here.

In any case, if that’s something worth looking into, I suggest discussing in a separate issue (I really think this would be much better for typical EEG/MEG signals).

Sure we could discuss on the issue tracker. No ref offhand for the every-nth-sample thing for ICA, someone would have to dig through the code and probably ask based on blame from a long time ago.