Patch motion correction fails child process terminated unexpectedly

Hi,

We recntly installed a new instance of cryoSparc 3.0.1. I imported tiff movs without any issues, but when I run patch motion correction i get the following message below. I am using 3 GPUs 2080ti in this job when it happens. If i use only one, it runs normally but it takes a very log time. (6h for 200 movs).

What seems to be the issue and how to correct it?

[CPU: 238.3 MB]  Traceback (most recent call last):

File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 17822 has terminated unexpectedly!

[CPU: 1.37 GB] Traceback (most recent call last):
File “/home/nxt193/software/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1722, in run_with_except_hook
run_old(*args, **kw)
File “/home/nxt193/software/cryoSPARC/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “/home/nxt193/software/cryoSPARC/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 164, in thread_work
work = processor.process(item)
File “cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 337, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
AssertionError: Job is not in running state - worker thread with PID 17823 terminating self.

Hi there,

It sounds like it might be a driver/Cuda version issue, but I can’t be sure just from the error message. Could you please post the output of the following command?
uname -a && free -g && lscpu && nvidia-smi

Thanks,
Harris

Hi Harris,
Today we got the access to HPC after a maintenance and thus my reply comes a bit late. Here is the file:

Linux cgput002 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buff/cache available
Mem: 125 1 117 0 6 123
Swap: 7 0 7
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 3099.792
CPU max MHz: 3100.0000
CPU min MHz: 1200.0000
BogoMIPS: 4399.96
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9
NUMA node1 CPU(s): 10-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp
Fri Jan 15 16:58:26 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:02:00.0 Off | N/A |
| 27% 21C P8 7W / 250W | 0MiB / 11019MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:03:00.0 Off | N/A |
| 27% 21C P8 21W / 250W | 0MiB / 11019MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… On | 00000000:81:00.0 Off | N/A |
| 27% 22C P8 22W / 250W | 0MiB / 11019MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… On | 00000000:82:00.0 Off | N/A |
| 27% 23C P8 18W / 250W | 0MiB / 11019MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Also, in the same situation, I am getting this errors:

icense is valid.

Launching job on lane cgput002 target cgput002 …

Running job on master node hostname cgput002

[CPU: 82.8 MB] Project P1 Job J27 Started

[CPU: 82.8 MB] Master running v3.0.1, worker running v3.0.1

[CPU: 83.0 MB] Running on lane cgput002

[CPU: 83.0 MB] Resources allocated:

[CPU: 83.0 MB] Worker: cgput002

[CPU: 83.0 MB] CPU : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

[CPU: 83.0 MB] GPU : [0, 1, 2]

[CPU: 83.0 MB] RAM : [0, 1, 2, 3, 4, 5]

[CPU: 83.0 MB] SSD : False

[CPU: 83.0 MB] --------------------------------------------------------------

[CPU: 83.0 MB] Importing job module for job type patch_motion_correction_multi…

[CPU: 223.1 MB] Job ready to run

[CPU: 223.1 MB] ***************************************************************

[CPU: 225.2 MB] Job will process this many movies: 4067

[CPU: 225.3 MB] parent process is 25636

[CPU: 184.6 MB] Calling CUDA init from 25667

[CPU: 184.6 MB] Calling CUDA init from 25668

[CPU: 184.6 MB] Calling CUDA init from 25669

[CPU: 238.3 MB] Outputting partial results now…

[CPU: 237.8 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py”, line 402, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi
AssertionError: Child process with PID 25667 has terminated unexpectedly!

HI @N.Bogdanovic,

Is it possible if you can send the joblog for this job?
e.g.cryosparcm joblog P1 J1

Hi Stephan,

We have the same issue with v3.0.1. Patch motion correction, CTF estimation and Extract from micrographs jobs are failing generating similar errors. The behavior is sporadic. The same job may fail after reading of 100 movies or it may process a few thousand movies and then die. The error seem to be always the same:

Traceback (most recent call last):
File “/programs/linux64/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1722, in run_with_except_hook
run_old(*args, **kw)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/threading.py”, line 870, in run
self._target(*self._args, **self._kwargs)
File “/programs/linux64/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py”, line 64, in stage_target
work = processor.process(item)
File “/programs/linux64/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py”, line 266, in process
update_alignments3D=update_alignments3D)
File “/programs/linux64/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py”, line 149, in do_extract_particles_single_mic_gpu
stream=stream)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py”, line 127, in init
onembed, ostride, odist, self.fft_type, self.batch)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py”, line 742, in cufftMakePlanMany
cufftCheckStatus(status)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py”, line 117, in cufftCheckStatus
raise e
skcuda.cufft.cufftAllocFailed

The job.log file has the following error:

**custom thread exception hook caught something
**** handle exception rc
set status to failed
Traceback (most recent call last):
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/queues.py”, line 242, in _feed
send_bytes(obj)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py”, line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py”, line 404, in _send_bytes
self._send(header + buf)
File “/programs/linux64/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

I would appreciate your help.

Thank you,
Sergei