"Child process with PID has terminated unexpectedly" during patch motion correction


I’ve been using Cryosparc for a while and never encountered this problem before. When I finished importing movies and during Patch motion correction (Multi), the job just failed:

Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 6062 has terminated unexpectedly!

At first, I thought my original image/movie file is broken, so I re-transferred the file (for expamle, file no.1000) from database and tried again, this time the job was able to continue after “processing 1000 of xxxx”, but will fail maybe during processing other files. I tried many times, and looks like if it is caused by broken original images/movies, then my data transferring process (I used rsync) has almost 1% error rate, which is basically impossible.

Output of: uname -a && free -g && lscpu && nvidia-smi:

Linux jade 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[chenyuzhou@jade ~]$ uname -a && free -g && lscpu && nvidia-smi
Linux jade 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 31 12 2 0 15 17
Swap: 15 0 14
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 165
Model name: Intel(R) Core™ i9-10850K CPU @ 3.60GHz
Stepping: 5
CPU MHz: 4900.122
CPU max MHz: 5300.0000
CPU min MHz: 800.0000
BogoMIPS: 7200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
Tue Apr 5 09:46:30 2022
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 30% 57C P2 123W / 350W | 1216MiB / 24265MiB | 4% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 2337 G /usr/lib64/firefox/firefox 4MiB |
| 0 N/A N/A 2604 G /usr/bin/X 515MiB |
| 0 N/A N/A 3199 G /usr/bin/gnome-shell 112MiB |
| 0 N/A N/A 4231 G /usr/lib64/firefox/firefox 4MiB |
| 0 N/A N/A 4326 G /usr/lib64/firefox/firefox 4MiB |
| 0 N/A N/A 5281 G …ra64-1.15rc/bin/python2.7 297MiB |
| 0 N/A N/A 7765 G /usr/lib64/firefox/firefox 4MiB |
| 0 N/A N/A 8904 G /usr/lib64/firefox/firefox 4MiB |
| 0 N/A N/A 10358 C python 259MiB |
| 0 N/A N/A 23744 G /usr/lib64/firefox/firefox 4MiB |

The job log:

cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File “/home/chenyuzhou/Programs/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File “/home/chenyuzhou/Programs/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/queues.py”, line 242, in _feed
File “/home/chenyuzhou/Programs/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
========= sending heartbeat
========= sending heartbeat
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

Thank you,

Welcome to the forum @czhou.
What version of cryoSPARC are you running? You can find that information on the GUI’s dashboard.
Did patch motion correction work on this particular cryoSPARC instance in the past? Have their been any changes in the software (incl. CUDA toolkit) since?
What are the versions of the CUDA toolkit and the pycuda driver?

eval $(/home/chenyuzhou/Programs/cryosparc/cryosparc_worker/bin/cryosparcw env)
${CRYOSPARC_CUDA_PATH}/bin/nvcc --version
python -c "import pycuda.driver; print(pycuda.driver.get_version())"

Hi, thanks for reply.

The current Cryosparc version is 3.3.1, and it never happened before when using same cryosparc version. Nothing has changed.

And this is the output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Aug_15_21:14:11_PDT_2021
Cuda compilation tools, release 11.4, V11.4.120
Build cuda_11.4.r11.4/compiler.30300941_0

(11, 4, 0)

Thanks for your help.

@czhou We currently recommend CUDA toolkit version 11.2 or lower. Please can you test whether the job completes after configuring cryoSPARC with CUDA version 11.2?
You may be able to install the toolkit alone in a custom path, as a non-root user alongside existing version 11.4 by adapting these instructions.
Be sure to run:
cryosparcw newcuda <path-to-cuda>
after installation of CUDA toolkit 11.2.

Did this resolve? Observing similar phenomeneon but running CUDA 11.2, latest version with patch.

Please can you paste any error from the job’s Overview tab and the joblog.

After reverting CUDA 11.2 the same error occured. However after a system reboot the error no longer occurs. Thanks for the quick respone.