pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered

Hello All,
I am running into some weird issues while running the Patch Motion job. Midway through the job after some successful processing of images, it fails with some cuda error which I do not understand.

Here is a part of the event log.

##################################
 [CPU: 3.78 GB]

-- 0.0: processing 42 of 1000: J11/imported/009991550500362512406_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00009enn.frames.tif
        loading /home/uwm/tmalla/Data/tmalla/Work/NYSBC/2022/NOV/cryosparc-jobs/CS-phytochrome-pcm/J11/imported/009991550500362512406_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00009enn.frames.tif
        Loading raw movie data from J11/imported/009991550500362512406_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00009enn.frames.tif ...
        Done in 14.45s
        Loading gain data from J11/imported/m22nov25a_25123045_01_8184x11520_norm_0.mrc ...
        Done in 0.00s
        Processing ...
[CPU: 4.14 GB]

-- 0.0: processing 43 of 1000: J11/imported/003194348880911734508_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00010enn.frames.tif
        loading /home/2022/NOV/cryosparc-jobs/CS-pcm/J11/imported/003194348880911734508_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00010enn.frames.tif
        Loading raw movie data from J11/imported/003194348880911734508_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00010enn.frames.tif ...
        Done in 16.83s
        Loading gain data from J11/imported/m22nov25a_25123045_01_8184x11520_norm_0.mrc ...
        Done in 0.00s
        Processing ...
[CPU: 367.1 MB]

Error occurred while processing J11/imported/009991550500362512406_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00009enn.frames.tif
Traceback (most recent call last):
  File "/tank/data/Programs/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 60, in exec
    return self.process(item)
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 177, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 180, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 182, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 255, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 669, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 313, in cryosparc_compute.engine.cuda_core.EngineBaseThread.toc
  File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 309, in cryosparc_compute.engine.cuda_core.EngineBaseThread.wait
pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered

Marking J11/imported/009991550500362512406_m22nov25a_g_00013gr_00065sq940_v01_00005hl_00009enn.frames.tif as incomplete and continuing...
#############################################################

I have separated the raw images into blocks of 1000 images each. And I send the job parallely. Some jobs continue to completion, while some fail.

Are there any other workloads competing with CryoSPARC for GPU resources?
What is the movie format?
Please can you provide for the worker in question:

All these jobs are being sent to the same gpu node. Other than my cryosparc, nothing is using the Gpu node.
The movie format is in *.tif. I had to turn on the “Skip header check” to proceed.

Here is the output of the command:

[user@execute-3000 ~]$ uname -a && free -g && nvidia-smi
Linux execute-3000.mortimer.hpc 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
              total        used        free      shared  buff/cache   available
Mem:            251          17         185           0          48         231
Swap:             1           0           1
Thu Dec  8 14:04:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   48C    P0   262W / 250W |  14626MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   40C    P0    65W / 250W |    463MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    545186      C   python                           7311MiB |
|    0   N/A  N/A    545187      C   python                           7311MiB |
|    1   N/A  N/A    545384      C   python                            461MiB |
+-----------------------------------------------------------------------------+

A post was split to a new topic: cuStreamSynchronize failed during reconstruction

@73km May I also ask:

  1. In a job where some movies fail, do all movies fail?
  2. Do movies fail randomly, or do movies that fail once also fail during subsequent attempts (for example after splitting the data set into “blocks” of images), movies that are successfully processed once will also be successfully processed in subsequent runs?
  3. In case failures do not happen randomly, but consistently for specific movies, would you be willing to share a “failing” movie with us?