Strange Patch Motion error with RTX 6000 Pro upgrade

Hi CryoSPARC team,

We upgraded one of our boxes to a pair of RTX 6000 Pro cards, and in the process of doing a test-run, micrographs from a dataset (previously OK) randomly started failing to run. They read fine via RELION, and on another system, and the error seems to indicate an issue with CUDA (“CUDA_ERROR_LAUNCH_TIMEOUT”), rather than with the micrographs themselves. Network activity is commensurate with reading the micrograph, then failing to actually do anything with it.

Error example:

[CPU:  565.7 MB  Avail:1316.99 GB]

Error occurred while processing J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
    return current_context().create_stream()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
    handle = driver.cuStreamCreate(flags)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT

Marking J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer as incomplete and continuing...

Section of the job log where errors are occurring:

ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:07,300 core                 heartbeat        INFO   | ========= Updating heartbeat
2026-03-26 02:01:13,317 numba.cuda.cudadrv.d _check_cuda_pyth ERROR  | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:17,320 core                 heartbeat        INFO   | ========= Updating heartbeat
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:25,217 numba.cuda.cudadrv.d _check_cuda_pyth ERROR  | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
2026-03-26 02:01:27,339 core                 heartbeat        INFO   | ========= Updating heartbeat

There is an error in dmesg regarding the GPU, but it is days earlier and more than 2,000 micrographs ran successfully.

[  687.296703] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7
[  687.299550] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7

Output of nvidia-smi:

❯ nvidia-smi
Thu Mar 26 09:23:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.45.04              Driver Version: 595.45.04      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:86:00.0 Off |                  Off |
| 30%   54C    P1             69W /  300W |    3457MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:AF:00.0 Off |                  Off |
| 30%   58C    P1             77W /  300W |    4351MiB /  97887MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           17843      G   /usr/lib/xorg/Xorg                      217MiB |
|    0   N/A  N/A          275468      C   .../.pixi/envs/worker/bin/python       3188MiB |
|    1   N/A  N/A           17843      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A          275469      C   .../.pixi/envs/worker/bin/python       4328MiB |
+-----------------------------------------------------------------------------------------+

I understand that one of the releases of 595 had a few bugs, so I’m going to check for updates and reboot, but if you’ve got any ideas, I’d be glad to read them.

Thanks in advance.

After updating to 595.58.03, no CUDA_ERROR_LAUNCH_TIMEOUT errors yet… but it’s processing slower than the cards which were in there previously did - ~2,000 micrographs in ~18 hours (1GPU). :scream:

Will try another dataset as a sanity check.

Hm. Thought the driver update had fixed it. Maybe not.

Maybe some sort of CUDA memory leak? Run looked OK until about two hours ago, then every micrograph on fails with this error:



Error occurred while processing J1/imported/008629463230463898891_FoilHole_28898038_Data_28856375_14_20250729_081649_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 851, in _attempt_allocation
    return allocator()
           ^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 482, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 399, in compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 377, in empty
    return device_array(shape, dtype, stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 333, in device_array
    arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 122, in __init__
    super().__init__(shape, strides, dtype, stream, gpu_data)
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1372, in memalloc
    return self.memory_manager.memalloc(bytesize)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1056, in memalloc
    ptr = self._attempt_allocation(allocator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 863, in _attempt_allocation
    return allocator()
           ^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Marking J1/imported/008629463230463898891_FoilHole_28898038_Data_28856375_14_20250729_081649_EER.eer as incomplete and continuing...

I don’t think 96GB GPUs will be running out of VRAM processing micrographs which can be motion corrected on a 16GB card…

I cut down the number of EER fractions, wondering if that might be the cause. No, still throwing the CUDA_ERROR_LAUNCH_TIMEOUT error regularly:

Error occurred while processing J29/imported/008708589046453274863_FoilHole_28884487_Data_28856370_52_20250729_000925_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
    return current_context().create_stream()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
    handle = driver.cuStreamCreate(flags)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT

Marking J29/imported/008708589046453274863_FoilHole_28884487_Data_28856370_52_20250729_000925_EER.eer as incomplete and continuing...

At this point I’m basically out of ideas. Might try the cards in another system, but it’ll have to wait at least a month.

are these new cards significantly higher power draw then the ones they replaced? I’ve seen weird errors when the cards are pushing PSU limits: PSU that are getting close to their overall rated capacity, or capacity on its 12V lines (if it has multiple rails), or I once forgot to plug in all the supplemental power connectors on the motherboard, it had 3, not 2 as I’d assumed. I had no problems for years then started seeing sporadic failures until I plugged in the extra motherboard power connector.

you can try artificially limiting the power ceiling using nvidia-smi and see if that helps- I would often do this anyways to keep the cards cooler and quieter, they start pushing diminishing returns towards their power ceiling. (though I have not had the chance to use the RTX Pro 6000s)

Good idea, but sadly they have the same power limit as the cards which were in there previously - 300W. I avoid cards which draw >300W, really do not trust 600W through those small 12VHPWR pins after all the meltdown reports online.

After a few false starts, all micrographs are now motion corrected. It managed 8,400 mics in the first Patch Motion job. Marking that as complete and starting another Patch Motion job with the incomplete exposures allowed another 4,220 to complete successfully, and a third round of Patch Motion all the rest completed OK.

Now all in Patch CTF, which (so far) seems to be behaving.