Hello cryoSPARC team,
For some time we have been having this error during patch motion correction:
Child process with PID XXXXXX terminated unexpectedly with exit code -11.
We have tried rolling back to CS 4.4, and out IT person troubleshooted alot of other things. This error happens in both of our worksations with different GPUs and on different datasets. The crashes always happen at a random movie, the GPUs crash sequentially (the remaining ones can run for several minutes after the first crashed) and sometimes it will run for 5 minutes, sometimes for 15 minutes.
Surprisingly, our cluster setup is fine and doesn’t show this error. I was hoping you could shed some light on this issue.
You can find below one of the logs:
/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:
kernel(18): warning #177-D: variable "sd" was declared but never referenced
kernel(18): warning #177-D: variable "o" was declared but never referenced
warnings.warn(msg)
/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 12 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 12 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
gpufft: creating new cufft plan (plan id 4 pid 441338)
gpu_id 0
ndims 2
dims 5832 5832 0
inembed 5832 2917 0
istride 1
idist 17011944
onembed 5832 5834 0
ostride 1
odist 34023888
batch 1
type C2R
wkspc manual
Python traceback:
gpufft: creating new cufft plan (plan id 4 pid 441339)
gpu_id 1
ndims 2
dims 5832 5832 0
inembed 5832 2917 0
istride 1
idist 17011944
onembed 5832 5834 0
ostride 1
odist 34023888
batch 1
type C2R
wkspc manual
Python traceback:
/mnt/tesla/data/cryosparc/4.5.3/worker/cryosparc_compute/jobs/pipeline.py:59: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
return self.process(item)
========= sending heartbeat at 2024-06-18 13:18:33.381849
/mnt/tesla/data/cryosparc/4.5.3/worker/cryosparc_compute/jobs/pipeline.py:59: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
return self.process(item)
========= sending heartbeat at 2024-06-18 13:18:43.399320
========= sending heartbeat at 2024-06-18 13:18:53.419293
========= sending heartbeat at 2024-06-18 13:19:03.439589
<string>:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
========= sending heartbeat at 2024-06-18 13:19:13.459194
========= sending heartbeat at 2024-06-18 13:19:23.479220
========= sending heartbeat at 2024-06-18 13:19:33.500499
========= sending heartbeat at 2024-06-18 13:19:43.521875
========= sending heartbeat at 2024-06-18 13:19:53.534256
========= sending heartbeat at 2024-06-18 13:20:03.551332
========= sending heartbeat at 2024-06-18 13:20:13.571479
========= sending heartbeat at 2024-06-18 13:20:23.592214
========= sending heartbeat at 2024-06-18 13:20:33.612311
========= sending heartbeat at 2024-06-18 13:20:43.633280
========= sending heartbeat at 2024-06-18 13:20:53.653677
========= sending heartbeat at 2024-06-18 13:21:03.673853
========= sending heartbeat at 2024-06-18 13:21:13.694523
========= sending heartbeat at 2024-06-18 13:21:23.707346
========= sending heartbeat at 2024-06-18 13:21:33.719378
========= sending heartbeat at 2024-06-18 13:21:43.739718
========= sending heartbeat at 2024-06-18 13:21:53.759719
========= sending heartbeat at 2024-06-18 13:22:03.779323
========= sending heartbeat at 2024-06-18 13:22:13.799345
========= sending heartbeat at 2024-06-18 13:22:23.818677
========= sending heartbeat at 2024-06-18 13:22:33.839321
========= sending heartbeat at 2024-06-18 13:22:43.859212
========= sending heartbeat at 2024-06-18 13:22:53.879008
========= sending heartbeat at 2024-06-18 13:23:03.900074
========= sending heartbeat at 2024-06-18 13:23:13.920176
========= sending heartbeat at 2024-06-18 13:23:23.939329
========= heartbeat failed at 2024-06-18 13:23:23.969495:
========= sending heartbeat at 2024-06-18 13:23:33.979649
========= heartbeat failed at 2024-06-18 13:23:33.989203:
========= sending heartbeat at 2024-06-18 13:23:43.999350
========= heartbeat failed at 2024-06-18 13:23:44.011054:
************* Connection to cryosparc command lost. Heartbeat failed 3 consecutive times at 2024-06-18 13:23:44.011102.
/mnt/tesla/data/cryosparc/4.5.3/worker/bin/cryosparcw: line 150: 441301 Killed python -c "import cryosparc_compute.run as run; run.run()" "$@"
Let me know if you require any more information.
Thank you very much and best regards.