Strange Patch Motion error with RTX 6000 Pro upgrade

rbs_sci · March 26, 2026, 12:37am

Hi CryoSPARC team,

We upgraded one of our boxes to a pair of RTX 6000 Pro cards, and in the process of doing a test-run, micrographs from a dataset (previously OK) randomly started failing to run. They read fine via RELION, and on another system, and the error seems to indicate an issue with CUDA (“CUDA_ERROR_LAUNCH_TIMEOUT”), rather than with the micrographs themselves. Network activity is commensurate with reading the micrograph, then failing to actually do anything with it.

Error example:

[CPU:  565.7 MB  Avail:1316.99 GB]

Error occurred while processing J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
    return current_context().create_stream()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
    handle = driver.cuStreamCreate(flags)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT

Marking J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer as incomplete and continuing...

Section of the job log where errors are occurring:

ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:07,300 core                 heartbeat        INFO   | ========= Updating heartbeat
2026-03-26 02:01:13,317 numba.cuda.cudadrv.d _check_cuda_pyth ERROR  | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:17,320 core                 heartbeat        INFO   | ========= Updating heartbeat
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:25,217 numba.cuda.cudadrv.d _check_cuda_pyth ERROR  | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
2026-03-26 02:01:27,339 core                 heartbeat        INFO   | ========= Updating heartbeat

There is an error in dmesg regarding the GPU, but it is days earlier and more than 2,000 micrographs ran successfully.

[  687.296703] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7
[  687.299550] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7

Output of nvidia-smi:

❯ nvidia-smi
Thu Mar 26 09:23:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.45.04              Driver Version: 595.45.04      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:86:00.0 Off |                  Off |
| 30%   54C    P1             69W /  300W |    3457MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:AF:00.0 Off |                  Off |
| 30%   58C    P1             77W /  300W |    4351MiB /  97887MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           17843      G   /usr/lib/xorg/Xorg                      217MiB |
|    0   N/A  N/A          275468      C   .../.pixi/envs/worker/bin/python       3188MiB |
|    1   N/A  N/A           17843      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A          275469      C   .../.pixi/envs/worker/bin/python       4328MiB |
+-----------------------------------------------------------------------------------------+

I understand that one of the releases of 595 had a few bugs, so I’m going to check for updates and reboot, but if you’ve got any ideas, I’d be glad to read them.

Thanks in advance.

rbs_sci · March 27, 2026, 1:06am

After updating to 595.58.03, no CUDA_ERROR_LAUNCH_TIMEOUT errors yet… but it’s processing slower than the cards which were in there previously did - ~2,000 micrographs in ~18 hours (1GPU).

Will try another dataset as a sanity check.

rbs_sci · March 27, 2026, 9:07am

Hm. Thought the driver update had fixed it. Maybe not.

Maybe some sort of CUDA memory leak? Run looked OK until about two hours ago, then every micrograph on fails with this error:



Error occurred while processing J1/imported/008629463230463898891_FoilHole_28898038_Data_28856375_14_20250729_081649_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 851, in _attempt_allocation
    return allocator()
           ^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 482, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 399, in compute.gpu.gpucore.EngineBaseThread.ensure_allocated
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 377, in empty
    return device_array(shape, dtype, stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 333, in device_array
    arr = GPUArray(shape=shape, strides=strides, dtype=dtype, stream=stream)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/compute/gpu/gpuarray.py", line 122, in __init__
    super().__init__(shape, strides, dtype, stream, gpu_data)
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1372, in memalloc
    return self.memory_manager.memalloc(bytesize)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1056, in memalloc
    ptr = self._attempt_allocation(allocator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 863, in _attempt_allocation
    return allocator()
           ^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1054, in allocator
    return driver.cuMemAlloc(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

Marking J1/imported/008629463230463898891_FoilHole_28898038_Data_28856375_14_20250729_081649_EER.eer as incomplete and continuing...

I don’t think 96GB GPUs will be running out of VRAM processing micrographs which can be motion corrected on a 16GB card…

rbs_sci · March 29, 2026, 2:48am

I cut down the number of EER fractions, wondering if that might be the cause. No, still throwing the CUDA_ERROR_LAUNCH_TIMEOUT error regularly:

Error occurred while processing J29/imported/008708589046453274863_FoilHole_28884487_Data_28856370_52_20250729_000925_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
    return current_context().create_stream()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
    handle = driver.cuStreamCreate(flags)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT

Marking J29/imported/008708589046453274863_FoilHole_28884487_Data_28856370_52_20250729_000925_EER.eer as incomplete and continuing...

At this point I’m basically out of ideas. Might try the cards in another system, but it’ll have to wait at least a month.

craigyk · March 30, 2026, 10:35pm

are these new cards significantly higher power draw then the ones they replaced? I’ve seen weird errors when the cards are pushing PSU limits: PSU that are getting close to their overall rated capacity, or capacity on its 12V lines (if it has multiple rails), or I once forgot to plug in all the supplemental power connectors on the motherboard, it had 3, not 2 as I’d assumed. I had no problems for years then started seeing sporadic failures until I plugged in the extra motherboard power connector.

you can try artificially limiting the power ceiling using nvidia-smi and see if that helps- I would often do this anyways to keep the cards cooler and quieter, they start pushing diminishing returns towards their power ceiling. (though I have not had the chance to use the RTX Pro 6000s)

rbs_sci · March 30, 2026, 11:27pm

Good idea, but sadly they have the same power limit as the cards which were in there previously - 300W. I avoid cards which draw >300W, really do not trust 600W through those small 12VHPWR pins after all the meltdown reports online.

rbs_sci · March 31, 2026, 12:18am

After a few false starts, all micrographs are now motion corrected. It managed 8,400 mics in the first Patch Motion job. Marking that as complete and starting another Patch Motion job with the incomplete exposures allowed another 4,220 to complete successfully, and a third round of Patch Motion all the rest completed OK.

Now all in Patch CTF, which (so far) seems to be behaving.

wtempel · April 1, 2026, 4:12pm

@rbs_sci If this turns out not to be a power supply issue, you may consider testing:

with the v590 nvidia driver
if possible, without Xorg running. This suggestion is based on your mention of Failed to get memory pages for NvKmsKapiMemory and 495.44 - Failed to map NvKmsKapiMemory - Linux - NVIDIA Developer Forums

rbs_sci · May 14, 2026, 12:36am

Just to come back to this…

Tried various different drivers, swapped cards around, changed risers to PCI-E Gen5 ones (even though board is Gen 3 capable only), checked PSU (good via multimeter), tried every driver which supports the RTX 600 Pro.

Best driver seems to be 580.126.20.

But the errors are still happening - although they appear to take longer to manifest with driver 580: they appear after ~5,500 movies are processed, rather than ~2,000.

May 14, 2026, 2:08:48 AM

Error occurred while processing J17/imported/000052014394478622522_FoilHole_10732954_Data_10734285_17_20260511_160308_EER.eer
Traceback (most recent call last):
  File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
    return self.process(item)
           ^^^^^^^^^^^^^^^^^^
  File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
  File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
    return fn(*args, **kws)
           ^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
    return current_context().create_stream()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
    handle = driver.cuStreamCreate(flags)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT

Marking J17/imported/000052014394478622522_FoilHole_10732954_Data_10734285_17_20260511_160308_EER.eer as incomplete and continuing...

Two last things to try: having raw data local and after that switch cards into a different system, but they were purchased for this system, we don’t want them in a different one. Errors like this are only happening in CryoSPARC. MotionCor2/3, AreTomo3, RELION GPU paths, EMReady2, all run fine and complete without issues with output looking sane. Local LLMs seem stable (although how can you tell short of completely crashes?).

Well, let’s see what having the data locally does.

wtempel · May 14, 2026, 2:02pm

@rbs_sci Please can you post:

current output of these commands on the worker where the job failed:

uname -a
nvidia-smi
cat /sys/kernel/mm/transparent_hugepage/enabled

the value of the Number of GPUs to parallelize parameter for the failed job

bsobol · May 14, 2026, 2:13pm

Did you ensure that this happens on both cards? Maybe one of them is faulty? It’s possible that cryosparc’s motion_corr hits a specific path that triggers GPU to crash even if other software runs fine.

rbs_sci · May 15, 2026, 12:42am

@wtempel: Thanks.

❯ uname -a
Linux tungsten 6.17.0-22-generic #22~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 26 15:25:54 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

❯ nvidia-smi
Fri May 15 09:31:07 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:86:00.0 Off |                  Off |
| 34%   63C    P1             76W /  300W |     911MiB /  97887MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:AF:00.0 Off |                  Off |
| 34%   63C    P1             76W /  300W |     585MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4030      G   /usr/lib/xorg/Xorg                      297MiB |
|    0   N/A  N/A          987209      C   .../.pixi/envs/worker/bin/python        562MiB |
|    1   N/A  N/A            4030      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A          987214      C   .../.pixi/envs/worker/bin/python        562MiB |
+-----------------------------------------------------------------------------------------+

❯ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Oooh… THP is madvise. I added transparent_hugepage=never to my boot params across all boxes.

OK, that’s one more thing to check.

System has 2 GPUs, and Number of GPUs to parallelise is set to 2.

Yes, Xorg is up, but the last crash happened when it wasn’t running. And CryoSPARC is currently running. Further, it’s done 6,000+ mics without having this CUDA_ERROR_LAUNCH_TIMEOUT hit.

@bsobol:

Valid point, perhaps one is faulty. I think it would happen faster and be more easily reproducible if so, though.

Intermittent bugs are the worst bugs. If it happened immediately, or at least in a few hundred mics, it would be a lot easier to hunt down.

rbs_sci · May 19, 2026, 12:27am

Well, after that last CUDA_TIMEOUT faff, it completed the other 8,800 mics without error… Patch CTF, no error; picking, extraction, 2D, ab initio, NU and homogeneous refinement, local/global CTF refinement, RBMC… no errors.

Getting this to trigger “reliably” is incredibly frustrating.