Patch Motion Correction - RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

eMKiso · November 10, 2023, 4:19pm

Hi all,

we just upgraded our CS from 4.3 to 4.4.
I tried to run Extensive validation but already the 3rd job (Patch Motion Correction) fails due to:

RuntimeError: Could not allocate GPU array: CUDA_ERROR_OUT_OF_MEMORY

I tried and cloned this job an selected the ‘Low-memory mode’ and got the same result. I can see that during the job execution the GPU RAM usage gets above 99%.

This was not happening before the update.

Basic info:

single workstation
Current cryoSPARC version: v4.4.0
uname -a && free -g

Linux thorin-d11 4.15.0-128-generic #131-Ubuntu SMP Wed Dec 9 06:57:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:            187           5         115           1          66         175
Swap:             0           0           0`

nvidia-smi

Fri Nov 10 17:04:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080         Off| 00000000:18:00.0 Off |                  N/A |
| 33%   36C    P8                5W / 200W|      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080         Off| 00000000:3B:00.0 Off |                  N/A |
| 33%   30C    P8               10W / 200W|      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080         Off| 00000000:86:00.0 Off |                  N/A |
| 33%   27C    P8                6W / 200W|      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce GTX 1080         Off| 00000000:AF:00.0 Off |                  N/A |
| 33%   28C    P8                9W / 200W|      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

not other task were running on the GPUs at the time

What is perhaps interesting, but I am not sure how it was before the update, is the fact that CS reports the correct model of the GPU card but the GPU RAM is rounded up. See the screenshot:

From nvidia-smi reported above you can see that the card has 8192 MB of RAM.

The workstation and the cryosparc have both been restarted since the update.
I did though first update CS to 4.4, it reported the error about the unsupported version, updated the Nvidia drivers, restarted the workstation and then ended up with this error I am reporting here.

Any ideas?
Thanks!

wuhucryoem · November 16, 2023, 8:20am

I’m having the same issue, and it seems that the memory of GPU cannot be release after the last movies has been processed,because the first one can be processed well. Maybe the staff can help you @mmclean.

eMKiso · November 16, 2023, 8:33am

Hi @wuhucryoem ,

in our case already the first movie fails. And all the next ones as well.

Best!

wuhucryoem · November 16, 2023, 8:49am

Or you can rollback to v4.3.1 by cryosparcm update --version=v4.3.1

eMKiso · November 16, 2023, 9:06am

Thanks!

We tried Full-frame motion correction and that one is working.

wuhucryoem · November 16, 2023, 10:07am

If fixed, I think patch motion correction is much better than full-frame motion correction.

eMKiso · November 16, 2023, 10:23am

Sure, I agree!
We almost always use Local Motion Correction in later stages so Full-frame motion correction should be fine until the issue is resolved. I already got the tip from the CS tema that the issue is probably GPU VRAM, since the minimum required amount for CS is 11 GB. We have 8 GB VRAM.

So of nothing else works than we will probably need to upgrade the hardware.

wuhucryoem · November 17, 2023, 1:43am

But it’s weird that can running well in V4.3.1 or lower versions.@mmclean

wtempel · November 17, 2023, 8:40pm

Have you tried:

cloning a Patch Motion Correction job that succeeded in v4.3.1, and that clone failed in CryoSPARC v4.4?
building from scratch in CryoSPARC v4.4 a Patch Motion Correction job that exactly replicates the inputs and parameters of the v4.3.1 job from (1.), and that newly created job failed in v4.4.

If so, please can you post, for the v4.3.1 job, what is shown under Metadata|Data for the
instance_information.CUDA_version item?

wuhucryoem · November 18, 2023, 7:03am

yeah,than is the job cloned from v4.3.1 and finiahed at v4.4.It seems that the cuda version don’t match@ wtempel

wuhucryoem · November 18, 2023, 7:09am

and this is the job running in v4.3.1@wtempel

wtempel · November 20, 2023, 12:26am

Thanks for posting this information.

Please can you again describe the issue to which referred in your earlier post.
It seems that the v4.4 job for which you posted instance_information did complete, whereas this topic is
about CUDA_ERROR_OUT_OF_MEMORY, in which case I would expect a job would not to complete.

driver_version and CUDA_version in this case do not need to match, but must be compatible.

wuhucryoem · November 20, 2023, 1:30am

My apologies，I already rolled back the software in that time. I would post it again when the software upgrades to v4.4.

eMKiso · November 20, 2023, 8:13am

Hi @wtempel ,

I tried to clone an older Patch Motion Correction job from CS 4.0.2 and here I got a different error:

Importing job module for job type patch_motion_correction_multi...

Job ready to run

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 54, in cryosparc_master.cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
  File "/...../cryosparc2_worker/cryosparc_compute/jobs/runcommon.py", line 657, in load_input_group
    input_group = com.query(job['input_slot_groups'], lambda g : g['name'] == input_group_name, error='No match for %s in job %s' % (input_group_name, job['uid']))
  File "/....../cryosparc2_worker/cryosparc_compute/jobs/common.py", line 714, in query
    assert res != default, error
AssertionError: No match for doseweights in job J286

This means, in our case, a cloned Patch Motion Correction job also doesn’t work in CS 4.4.

Best!

hsnyder · November 20, 2023, 10:23pm

Hi @eMKiso, the patch motion error from the cloned job is a bug. Thanks for bringing it to our attention. You can work around that issue in the meantime by creating a new job instead of cloning an old one.
– Harris

hsnyder · November 20, 2023, 10:29pm

While we didn’t make a change to patch motion correction in v4.4 that we’d expect to increase the memory demand from an algorithmic perspective, we did change the way we interface with CUDA, and there may have been a (small) change in the amount of memory overhead associated with that. I will also note, however, that the CryoSPARC minimum system requirements include a GPU with 11 GB of VRAM. If you’re experiencing this issue, please check how much VRAM your GPU(s) have. While we make some effort to support 8 GB cards as well, we don’t guarantee compatibility and don’t test on them. I’m sorry for the inconvenience in this regard.

eMKiso · November 21, 2023, 10:07am

Hi @hsnyder ,
thank you for bringing the 11 GB limit to my attention.

How about the ‘Low-memory switch’ that is available in the Patch Motion Correction. Shouldn’t that help in this kind of situation?

Best!

hsnyder · November 21, 2023, 8:12pm

It will reduce the memory requirements, yes, but whether or not it will be enough depends on the data. Definitely give it a try!

eMKiso · November 21, 2023, 9:32pm

Hi, as mentioned in the original post of this topic, ‘Low-memory mode’ has no positive effect on our system.

wuhucryoem · November 22, 2023, 3:07am

Neither do I. I tried in v4.4 and it doesn’t work.