Hi CryoSPARC team,
We upgraded one of our boxes to a pair of RTX 6000 Pro cards, and in the process of doing a test-run, micrographs from a dataset (previously OK) randomly started failing to run. They read fine via RELION, and on another system, and the error seems to indicate an issue with CUDA (“CUDA_ERROR_LAUNCH_TIMEOUT”), rather than with the micrographs themselves. Network activity is commensurate with reading the micrograph, then failing to actually do anything with it.
Error example:
[CPU: 565.7 MB Avail:1316.99 GB]
Error occurred while processing J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer
Traceback (most recent call last):
File "/home/cryosparcer/bin/cryosparc_worker/compute/pipeline.py", line 69, in exec
return self.process(item)
^^^^^^^^^^^^^^^^^^
File "compute/jobs/motion_correction/run_patch.py", line 225, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
File "compute/jobs/motion_correction/run_patch.py", line 228, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
File "compute/jobs/motion_correction/run_patch.py", line 270, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
File "compute/jobs/motion_correction/run_patch.py", line 235, in compute.jobs.motion_correction.run_patch.run_patch_motion_correction_multi.motionworker.process
File "compute/jobs/motion_correction/patchmotion.py", line 330, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
File "compute/jobs/motion_correction/patchmotion.py", line 332, in compute.jobs.motion_correction.patchmotion.unbend_motion_correction
File "compute/gpu/gpucore.py", line 308, in compute.gpu.gpucore.EngineBaseThread.__init__
File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context
return fn(*args, **kws)
^^^^^^^^^^^^^^^^
File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/api.py", line 330, in stream
return current_context().create_stream()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 1480, in create_stream
handle = driver.cuStreamCreate(flags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 348, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cryosparcer/bin/cryosparc_worker/.pixi/envs/worker/lib/python3.12/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
Marking J1/imported/007341578134129898814_FoilHole_28879811_Data_28856365_46_20250728_223630_EER.eer as incomplete and continuing...
Section of the job log where errors are occurring:
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:07,300 core heartbeat INFO | ========= Updating heartbeat
2026-03-26 02:01:13,317 numba.cuda.cudadrv.d _check_cuda_pyth ERROR | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:17,320 core heartbeat INFO | ========= Updating heartbeat
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 756 frames in EER-TIFF file.
2026-03-26 02:01:25,217 numba.cuda.cudadrv.d _check_cuda_pyth ERROR | Call to cuStreamCreate results in CUDA_ERROR_LAUNCH_TIMEOUT
2026-03-26 02:01:27,339 core heartbeat INFO | ========= Updating heartbeat
There is an error in dmesg regarding the GPU, but it is days earlier and more than 2,000 micrographs ran successfully.
[ 687.296703] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7
[ 687.299550] [drm:__nv_drm_nvkms_gem_obj_init [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008600] Failed to get memory pages for NvKmsKapiMemory 0x00000000576429c7
Output of nvidia-smi:
❯ nvidia-smi
Thu Mar 26 09:23:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.45.04 Driver Version: 595.45.04 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:86:00.0 Off | Off |
| 30% 54C P1 69W / 300W | 3457MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX PRO 6000 Blac... On | 00000000:AF:00.0 Off | Off |
| 30% 58C P1 77W / 300W | 4351MiB / 97887MiB | 12% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 17843 G /usr/lib/xorg/Xorg 217MiB |
| 0 N/A N/A 275468 C .../.pixi/envs/worker/bin/python 3188MiB |
| 1 N/A N/A 17843 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 275469 C .../.pixi/envs/worker/bin/python 4328MiB |
+-----------------------------------------------------------------------------------------+
I understand that one of the releases of 595 had a few bugs, so I’m going to check for updates and reboot, but if you’ve got any ideas, I’d be glad to read them.
Thanks in advance.