How to make good use of CUDA

Hello,

Our lab got new toys: 3 nice servers running each 4 Nvidia A30 GPU. So for the first time, and since the dataset for my ongoing study is growing larger and larger, I am trying to set CUDA and use as efficiently as possible those new resources.

I did enable CUDA with mne.utils.set_config('MNE_USE_CUDA', 'true'), it did create the JSON configuration file, and the test pytest mne/tests/test_filter.py -k cuda is passing. But the test does not seem to be way faster than CPU-based computation.


What is the best way to use the available resources?

I have a dataset with 1k short raw recordings (4 minutes each) sampled at 512 Hz of 1 kHz where I apply:

  • Resampling to 512 Hz if sampled at 1 kHz
  • Bandpass filters
  • Rereference
  • ICA decomposition
  • Bad interpolation
  • PSD with welch method

For now, my approach is to spawn e.g. 40 workers (process) and give them the files to process one by one. At least I process them 40 at a time.

Now, with CUDA, if I use it for the operation above that supports it (I guess only resampling and BP, is there a way to support any of the others, especially ICA decomposition?), does it really make a difference, considering:

  • the low sampling rate and the short duration?
  • the number of files to resample is very small, thus the step that could benefit from CUDA is BP filtering.

Moreover, I guess the CUDA session must be initialized for every new process (possibly at every job?), and it seems like this operation takes a significant amount of time.

And a final point, I guess for each new process spawn, I should also give it a different GPU to work on with mne-python/cuda.py at 091da8f01aeeecd7d583ba596cf5a85cd649f192 Ā· mne-tools/mne-python Ā· GitHub
There is no shortcut to distribute the load between different CUDA compatible GPUs, right?


Iā€™m very new to CUDA, any tips on how to properly benefit from it would be appreciated :slight_smile:

1 Like

Whether or not CUDA speeds things up might be very system-dependent. I would first try the simplest use case (e.g., a single worker, comparing n_jobs=1 to n_jobs='cuda') to see if it helps in the first place. It used to make a big difference a couple of years ago when NumPy and SciPy used fftpack as their FFT backend, but now that they use pocketfft under the hood, there is probably much less benefit.

For what itā€™s worth, locally at some point I didnā€™t observe much if any benefit anymore, so I stopped bothering to use PyCUDA.

1 Like

I am correct that only filter and resampling steps could benefit from CUDA among the steps I listed?
Now my problem is that I am not familiar with CUDA, and I donā€™t know what to expect, or what is ā€˜normalā€™.

I ran this small function with fname as one of my EEG files (67 channels, 4 minutes @ 512 Hz) 100 times with different n_jobs:

def f(n_jobs):
    """Function to test cuda."""
    fname = r''
    raw = mne.io.read_raw_fif(fname, preload=True)
    raw.filter(
        l_freq=1.,
        h_freq=40.,
        picks=['eeg', 'eog', 'ecg'],
        method="fir",
        phase="zero-double",
        fir_window="hamming",
        fir_design="firwin",
        pad="edge",
        n_jobs=n_jobs
        )

Turns outā€¦

----------
n_jobs = 1
Mean: 0.41
STD: 0.108
----------
n_jobs = 2
Mean: 0.78
STD: 0.362
----------
-- CUDA --
Mean: 3.25
STD: 29.26
----------

How weird is that?

The CUDA variant is very large because of the first callā€¦ taking more than 5 minutes to complete (341 secondsā€¦). Next ones are way faster at 0.294 +/- 0.1386 (mean +/- STD).

I also have the impression that CUDA is not worth it for basic applications like FFT and ICA, because there are pretty efficient algorithms/libraries available for CPU. But I havenā€™t tried this in a while, so Iā€™m interested in what you find. Unfortunately, since the GPUs are not in your local machine, you canā€™t even use them for gaming in case there are no convincing speed-ups for your code :smile:.

1 Like