Child process terminated unexpectedly with exit code -11 in patch motion correction

Hello cryoSPARC team,

For some time we have been having this error during patch motion correction:
Child process with PID XXXXXX terminated unexpectedly with exit code -11.

We have tried rolling back to CS 4.4, and out IT person troubleshooted alot of other things. This error happens in both of our worksations with different GPUs and on different datasets. The crashes always happen at a random movie, the GPUs crash sequentially (the remaining ones can run for several minutes after the first crashed) and sometimes it will run for 5 minutes, sometimes for 15 minutes.

Surprisingly, our cluster setup is fine and doesn’t show this error. I was hoping you could shed some light on this issue.
You can find below one of the logs:

/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/cudadrv/nvrtc.py:257: UserWarning: NVRTC log messages whilst compiling kernel:

kernel(18): warning #177-D: variable "sd" was declared but never referenced

kernel(18): warning #177-D: variable "o" was declared but never referenced


  warnings.warn(msg)
/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 12 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
/mnt/tesla/data/cryosparc/4.5.3/worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 12 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
gpufft: creating new cufft plan (plan id 4   pid 441338) 
	gpu_id  0 
	ndims   2 
	dims    5832 5832 0 
	inembed 5832 2917 0 
	istride 1 
	idist   17011944 
	onembed 5832 5834 0 
	ostride 1 
	odist   34023888 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

gpufft: creating new cufft plan (plan id 4   pid 441339) 
	gpu_id  1 
	ndims   2 
	dims    5832 5832 0 
	inembed 5832 2917 0 
	istride 1 
	idist   17011944 
	onembed 5832 5834 0 
	ostride 1 
	odist   34023888 
	batch   1 
	type    C2R 
	wkspc   manual 
	Python traceback:

/mnt/tesla/data/cryosparc/4.5.3/worker/cryosparc_compute/jobs/pipeline.py:59: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
  return self.process(item)
========= sending heartbeat at 2024-06-18 13:18:33.381849
/mnt/tesla/data/cryosparc/4.5.3/worker/cryosparc_compute/jobs/pipeline.py:59: UserWarning: Cannot manually free CUDA array; will be freed when garbage collected
  return self.process(item)
========= sending heartbeat at 2024-06-18 13:18:43.399320
========= sending heartbeat at 2024-06-18 13:18:53.419293
========= sending heartbeat at 2024-06-18 13:19:03.439589
<string>:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
========= sending heartbeat at 2024-06-18 13:19:13.459194
========= sending heartbeat at 2024-06-18 13:19:23.479220
========= sending heartbeat at 2024-06-18 13:19:33.500499
========= sending heartbeat at 2024-06-18 13:19:43.521875
========= sending heartbeat at 2024-06-18 13:19:53.534256
========= sending heartbeat at 2024-06-18 13:20:03.551332
========= sending heartbeat at 2024-06-18 13:20:13.571479
========= sending heartbeat at 2024-06-18 13:20:23.592214
========= sending heartbeat at 2024-06-18 13:20:33.612311
========= sending heartbeat at 2024-06-18 13:20:43.633280
========= sending heartbeat at 2024-06-18 13:20:53.653677
========= sending heartbeat at 2024-06-18 13:21:03.673853
========= sending heartbeat at 2024-06-18 13:21:13.694523
========= sending heartbeat at 2024-06-18 13:21:23.707346
========= sending heartbeat at 2024-06-18 13:21:33.719378
========= sending heartbeat at 2024-06-18 13:21:43.739718
========= sending heartbeat at 2024-06-18 13:21:53.759719
========= sending heartbeat at 2024-06-18 13:22:03.779323
========= sending heartbeat at 2024-06-18 13:22:13.799345
========= sending heartbeat at 2024-06-18 13:22:23.818677
========= sending heartbeat at 2024-06-18 13:22:33.839321
========= sending heartbeat at 2024-06-18 13:22:43.859212
========= sending heartbeat at 2024-06-18 13:22:53.879008
========= sending heartbeat at 2024-06-18 13:23:03.900074
========= sending heartbeat at 2024-06-18 13:23:13.920176
========= sending heartbeat at 2024-06-18 13:23:23.939329
  ========= heartbeat failed at 2024-06-18 13:23:23.969495: 
========= sending heartbeat at 2024-06-18 13:23:33.979649
  ========= heartbeat failed at 2024-06-18 13:23:33.989203: 
========= sending heartbeat at 2024-06-18 13:23:43.999350
  ========= heartbeat failed at 2024-06-18 13:23:44.011054: 
 ************* Connection to cryosparc command lost. Heartbeat failed 3 consecutive times at 2024-06-18 13:23:44.011102.
/mnt/tesla/data/cryosparc/4.5.3/worker/bin/cryosparcw: line 150: 441301 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"

Let me know if you require any more information.

Thank you very much and best regards.

Welcome to the forum @Arpind.
Please can you describe the setup on which this error occurred:

  • is the GPU computer separate from the CryoSPARC master computer?
  • what are the outputs of these commands on the CryoSPARC master computer
    cryosparcm status | grep HOST
    free -h
    cat /sys/kernel/mm/transparent_hugepage/enabled
    

Thank you for your reply @wtempel .

This is a worksation setup so the GPUs and CryoSPARC master are all in the same computer.

Here are the outputs of the commands.

cryosparc@tesla:~$ ~/4.5.3/master/bin/cryosparcm status | grep HOST
export CRYOSPARC_MASTER_HOSTNAME="tesla.campus.mcgill.ca"

cryosparc@tesla:~$ free -h
                     total        used        free      shared  buff/cache   available
Mem:           503Gi       5.6Gi       185Gi       7.0Mi       312Gi       494Gi
Swap:          6.0Gi       229Mi       5.8Gi

cryosparc@tesla:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

cryosparc@tesla:~$ 

Thanks for the information @Arpind.
What are the outputs of these commands (run on tesla):

host tesla.campus.mcgill.ca
cryosparcm status | grep PORT
ps -eo pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo

No problem @wtempel .

Here are the outputs :

cryosparc@tesla:~$ host tesla.campus.mcgill.ca
tesla.campus.mcgill.ca has address 132.206.28.252

cryosparc@tesla:~$ cryosparcm status | grep PORT
export CRYOSPARC_BASE_PORT=61000

cryosparc@tesla:~$ ps -eo pid,ppid,start,rsz,vsz,cmd | grep -e cryosparc_ -e mongo
 421461       1   Jun 18 20840  41124 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /mnt/tesla/data/cryosparc/4.5.3/master/supervisord.conf
 421571  421461   Jun 18 829316 2308820 mongod --auth --dbpath /mnt/tesla/data/cryosparc/3.3.0/cryosparc_database --port 61001 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
 421678  421461   Jun 18 92664 149280 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:61002 cryosparc_command.command_core:start() -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 421679  421678   Jun 18 119700 915664 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn -n command_core -b 0.0.0.0:61002 cryosparc_command.command_core:start() -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 422222  421461   Jun 18 92464 149380 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:61003 -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 422233  422222   Jun 18 271032 1304232 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_vis:app -n command_vis -b 0.0.0.0:61003 -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 422246  421461   Jun 18 92728 149380 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:61005 -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 422247  422246   Jun 18 227392 1028904 python /mnt/tesla/data/cryosparc/4.5.3/master/deps/anaconda/envs/cryosparc_master_env/bin/gunicorn cryosparc_command.command_rtp:start() -n command_rtp -b 0.0.0.0:61005 -c /mnt/tesla/data/cryosparc/4.5.3/master/gunicorn.conf.py
 422280  421461   Jun 18 144820 1152764 /mnt/tesla/data/cryosparc/4.5.3/master/cryosparc_app/nodejs/bin/node ./bundle/main.js
2137526 2057350 13:45:06  2360   9076 grep --color=auto -e cryosparc_ -e mongo

cryosparc@tesla:~$ 

Thank you for the support.

@Arpind Please can you identify the command_core-related log files that contain logs from the time window when the heartbeat failed by running the following command on tesla:

grep -l "^2024-06-18 1" /mnt/tesla/data/cryosparc/4.5.3/master/run/command_core*

then make copies of identified log file(s) and send us the compressed copy or copies as an email attachment. I will send you a private message with our email address.