ICA slow on Apple Silicon ARM64 Mac?

cbrnr · March 22, 2023, 3:19pm

I recently got a new machine, a MacBook Pro with a 12 core M2 Pro CPU (Apple Silicon, ARM64 architecture) and 32GB RAM. This should be a pretty fast processor according to almost any benchmark I’ve seen, yet the time it takes to fit ICA is surprisingly (frustratingly) long.

I’ve tested this by running the ICA example, which prints out the runtime of various ICA algorithms:

from time import time
import mne
from mne.preprocessing import ICA
from mne.datasets import sample

data_path = sample.data_path()
meg_path = data_path / 'MEG' / 'sample'
raw_fname = meg_path / 'sample_audvis_filt-0-40_raw.fif'

raw = mne.io.read_raw_fif(raw_fname).crop(0, 60).pick('meg').load_data()

reject = dict(mag=5e-12, grad=4000e-13)
raw.filter(1, 30)


def run_ica(method, fit_params=None):
    ica = ICA(n_components=20, method=method, fit_params=fit_params,
              max_iter='auto', random_state=0)
    t0 = time()
    ica.fit(raw, reject=reject)
    fit_time = time() - t0
    title = ('%s (took %.1fs)' % (method, fit_time))
    ica.plot_components(title=title)


run_ica('fastica')
run_ica('picard')
run_ica('infomax')
run_ica('infomax', fit_params=dict(extended=True))

The machine which generates the docs has the following runtimes:

FastICA: 1.1s
Picard: 3.4s
Infomax: 2.6s
Extended Infomax: 4.4s

I’m assuming this is some cloud instance with limited resources, so these times should not be particularly impressive. However, here are the times I get on my Mac (yes, these are all native ARM64 binaries):

FastICA: 6.3s
Picard: 26.3s
Infomax: 0.8s
Extended Infomax: 1.4s

The Infomax results are OK-ish (about 3x faster), but FastICA (6x slower) and especially Picard (8x slower) are not what I expected. On my previous Intel-based Mac, Picard was always faster than Infomax, which is why I really liked that algorithm. Here are the numbers on my old MacBook Pro mid-2014 (using OpenBLAS, but MKL times are very similar):

FastICA: 6.3s
Picard: 1.2s
Infomax: 2.2s
Extended Infomax: 3.4s

Does anyone have similar experiences? Or am I doing something wrong?

Platform             macOS-13.2.1-arm64-arm-64bit
Python               3.10.10 (v3.10.10:aad5f6a891, Feb  7 2023, 08:47:40) [Clang 13.0.0 (clang-1300.0.29.30)]
Executable           /Users/clemens/Projects/mne-python/.direnv/python-3.10.10/bin/python3
CPU                  arm (12 cores)
Memory               32.0 GB

Core
├☑ mne               1.4.0.dev74+g6384a8901.d20230317
├☑ numpy             1.24.2 (OpenBLAS 0.3.21 with 12 threads)
├☑ scipy             1.10.1
├☑ matplotlib        3.7.1 (backend=MacOSX)
├☑ pooch             1.7.0
└☑ jinja2            3.1.2

Numerical (optional)
├☑ sklearn           1.2.2
├☑ nibabel           5.0.1
└☐ unavailable       numba, nilearn, dipy, openmeeg, cupy, pandas

Visualization (optional)
├☑ qtpy              2.3.0 (PySide6=6.4.3)
├☑ pyqtgraph         0.13.2
├☑ mne-qt-browser    0.4.0
└☐ unavailable       pyvista, pyvistaqt, ipyvtklink, vtk, ipympl

Ecosystem (optional)
└☐ unavailable       mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel

cbrnr · March 22, 2023, 3:27pm

I should have searched a bit more thoroughly first. It seems like the default BLAS (OpenBLAS) is very, very slow on ARM64 Macs (python - Why is numpy native on M1 Max greatly slower than on old Intel i5? - Stack Overflow). I guess I’ll try to activate a different library, I just need to find a way to do that without conda.

cbrnr · March 22, 2023, 3:39pm

It looks like it is absolutely necessary to use the Apple-provided Accelerate library for decent NumPy performance. Unfortunately, pip install numpy uses the slow OpenBLAS backend, but the following commands install NumPy with the Accelerate backend enabled:

pip install cython pybind11
pip install --no-binary :all: --no-use-pep517 numpy

This results in much faster runtimes:

FastICA: 0.4s
Picard: 0.6s
Infomax: 1.0s
Extended Infomax: 1.3s

richard · March 22, 2023, 6:53pm

I ran the same test on my 2023 14" M2 MBP with 10 cores (6 performance and 4 efficiency).

❯ mamba create -n mne-ica-test mne python=3.10
❯ mamba run -n mne-ica-test python ica-test.py

FastICA: 1.5s
Picard: 4.2s
Infomax: 1.5s
Extended Infomax: 5.8s

Then I switched to Accelerate:

❯ mamba install -n mne-ica-test "libblas=*=*accelerate"
❯ mamba run -n mne-ica-test python ica-test.py

FastICA: 0.3s
Picard: 0.5s
Infomax: 0.9s
Extended Infomax: 1.2s

Switching the BLAS implementation on conda-forge is described here.

I do remember, though, that when we first tried to get MNE-Python to run on Apple Silicon, Accelerate yielded incorrect results for some operations (i.e., linear algebra tests failed). I don’t remember which tests we ran, but they were part of the MNE-Python test suite. @larsoner could you provide any insights into what we need to check to be sure that we’re not getting incorrect results with Accelerate?

Best,
Richard

cc @agramfort

larsoner · March 22, 2023, 7:07pm

One good option would be to change CirrusCI to use Accelerate. I don’t think we want to add another run because Cirrus does not parallelize for us, but I think we can trust NumPy and SciPy to test their OpenBLAS-based computations.

But if you want to test locally I would just pytest -m "not ultraslowtest" mne/ and 1) check to make sure this passes with OpenBLAS (it should) then 2) check if it does with Accelerate.

richard · March 22, 2023, 7:27pm

So the strategy would be:

Use Accelerate on Cirrus
Switch Apple Silicon installers to use Accelerate

I’m not sure about the “default” conda-forge package. We may want to simply amend our install instructions to describe how to manually switch to Accelerate. – Installing MNE is becoming more and more complex again

cbrnr · March 22, 2023, 7:36pm

It’s also worth noting that your figures with default OpenBLAS are not that terrible. Maybe it’s because conda-forge is linking to something more optimized, or maybe it’s my setup (although that is as “standard” as it can get).

cbrnr · March 22, 2023, 7:38pm

I’ve asked about this on the NumPy mailing list. I’ll post any answers here.

cbrnr · March 24, 2023, 10:46am

NumPy and SciPy will default to distribute wheels that link to Accelerate when macOS 14 is out (sometime this fall/winter). Until then, pip users will have to install from source (with the command I mentioned). People using conda have always been able to dynamically switch BLAS backends, so they should also manually do this. Note that this manual switching (for both pip and conda) requires macOS 13.3 for SciPy (not out yet).

https://mail.python.org/archives/list/numpy-discussion@python.org/message/W74HI226VJSOU7B6ZFKHLMVUFD4E7ITD/

Topic		Replies	Views
Has anyone used binary (compiled) implementations of infomax ICA Support & Discussions preprocessing	10	82	November 30, 2024
Finding the alternative to Matlab's runica inmne Mailing List Archive (read-only) list-archive	1	319	November 10, 2017
How to set up and fit the ICA? Support & Discussions ica	2	705	October 5, 2022
How to do PCA/ICA to an epoch? Support & Discussions preprocessing	12	1006	June 15, 2022
Indexing a matrix of XXX elements would incur an in integer overflow in LAPACK Support & Discussions preprocessing	13	439	September 20, 2024

ICA slow on Apple Silicon ARM64 Mac?

Related topics