Motion correction terminates unexpectedly with exit code -11

I installed cryosparc v 4.1.0 on new Rocky 8 workstation. While I ran the T20s extensive workflow fine, I cannot get beyond the patch motion correction jobs on my own data, though they initialize normally and typically process ~150 movies, the GPU’s fall off of the job with one process yielding the error code
"Child process with PID 1XXXX terminated unexpectedly with exit code -11." before the other encounters the same error, and terminates the job several movies later. update to v4.1.1 has not resolved the issue. Issue also occurs with Full-frame Motion correction and whether they are run with one or both GPUs.
From event log:
some information redacted to avoid identifying the project
Initialization

License is valid.

Launching job on lane default target jason-bourne …

Running job on master node hostname jason-bourne

[CPU: 196.6 MB]
Job J46 Started

[CPU: 196.6 MB]
Master running v4.1.1, worker running v4.1.1

[CPU: 196.6 MB]
Working in directory: /mnt/KRABBY-11/20221212_***/CS-20221212-***s/J46

[CPU: 196.6 MB]
Running on lane default

[CPU: 196.6 MB]
Resources allocated:

[CPU: 196.6 MB]
Worker: jason-bourne

[CPU: 196.6 MB]
CPU : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

[CPU: 196.6 MB]
GPU : [0, 1]

[CPU: 196.6 MB]
RAM : [0, 1, 2, 3]

[CPU: 196.6 MB]
SSD : False

[CPU: 196.6 MB]

[CPU: 196.6 MB]
Importing job module for job type patch_motion_correction_multi…

[CPU: 235.0 MB]
Job ready to run

[CPU: 235.0 MB]


[CPU: 237.4 MB]
Job will process this many movies: 3846

[CPU: 237.4 MB]
parent process is 15503

[CPU: 179.4 MB]
Calling CUDA init from 15550

[CPU: 179.4 MB]
Calling CUDA init from 15551

The error

[CPU: 368.5 MB]

[CPU: 368.5 MB]
Processed 50 of 3846 movies in 339.78s

[CPU: 368.9 MB]
Child process with PID 15550 terminated unexpectedly with exit code -11.

[CPU: 364.4 MB]

[CPU: 364.4 MB]
Compiling job outputs…

[CPU: 364.4 MB]
Passing through outputs for output group micrographs from input group movies

[CPU: 364.4 MB]
This job outputted results [‘micrograph_blob_non_dw’, ‘micrograph_thumbnail_blob_1x’, ‘micrograph_thumbnail_blob_2x’, ‘micrograph_blob’, ‘background_blob’, ‘rigid_motion’, ‘spline_motion’]

[CPU: 364.4 MB]
Loaded output dset with 51 items

[CPU: 364.4 MB]
Passthrough results [‘movie_blob’, ‘gain_ref_blob’, ‘mscope_params’]

[CPU: 364.4 MB]
Loaded passthrough dset with 3846 items

[CPU: 364.4 MB]
Intersection of output and passthrough has 51 items

[CPU: 364.4 MB]
Passing through outputs for output group micrographs_incomplete from input group movies

[CPU: 364.4 MB]
This job outputted results [‘micrograph_blob’]

[CPU: 364.4 MB]
Loaded output dset with 3795 items

[CPU: 364.4 MB]
Passthrough results [‘movie_blob’, ‘gain_ref_blob’, ‘mscope_params’]

[CPU: 364.4 MB]
Loaded passthrough dset with 3846 items

[CPU: 364.4 MB]
Intersection of output and passthrough has 3795 items

[CPU: 364.4 MB]
Checking outputs for output group micrographs

[CPU: 364.4 MB]
Checking outputs for output group micrographs_incomplete

[CPU: 364.5 MB]
Updating job size…

[CPU: 364.5 MB]
Exporting job and creating csg files…

[CPU: 364.5 MB]


[CPU: 364.5 MB]
Job complete. Total time 370.13s

Log file output :

the command “uname -a && free -g && lscpu && nvidia-smi” yields the following:

System specs are as follows: CPU: AMD Ryzen 5950X, GPUs 2x 3090Ti (limited to 320 Watts each), MOBO, X570 Taichi, RAM 128 GB DDR4, PSU 1200W.

Workstation is new so no comparison with previous cryosparc versions.
Not sure what may be causing GPU’s to initialize normally but subsequently fall off. Issue occurs running both or just one GPU. No obvious issues with system or GPU stability.

Any suggestions would be much appreciated!

A long time ago I had a problem with GPUs “falling off the bus” (as dmesg put it) with kernels 4.15 and 4.18… might be that that bug has manifest again in CentOS, instead of Ubuntu. What does dmesg output say immediately after cryoSPARC fails?

Seems to happen more with AMD chipsets than Intel ones.

dmesg yields the following immediately after failure

“[76743.004240] IPv6: ADDRCONF(NETDEV_UP): wlp5s0: link is not ready
[76771.355512] python[39968]: segfault at 854 ip 00007f66deb53b65 sp 00007f66beffb1b0 error 4 in blobio_native.so[7f66deb4a000+2f000]
[76771.355533] Code: 00 00 0f 29 9c 24 90 00 00 00 0f 29 a4 24 a0 00 00 00 0f 29 ac 24 b0 00 00 00 0f 29 b4 24 c0 00 00 00 0f 29 bc 24 d0 00 00 00 <48> 63 bb 54 08 00 00 4c 8d 63 54 48 8d 84 24 10 01 00 00 c7 04 24”

There is one message for each gpu worker that dies. I believe the IPv6 is related to wifi connection.

Hi @awsteven,

Thanks for reporting this issue. What format is your data in? Have you noticed it always failing on the same movie, or it is not predictable? Somewhat ironically, it’s crashing while trying to print out an error message related to reading a movie, but unfortunately doesn’t get as far as actually telling us what the underlying error was. This might very well be a bug, and I’ll investigate it, but in the meantime, it’s possible that some of the movies are corrupt and that a workaround might be to delete those movies (though if you do find that this always happens on the same movies, please keep those “bad” movies around, maybe in a separate folder, as it may help to replicate the issue and find a potential bug)

–Harris

The wlp5s0 output is old enough that it’s unrelated.

Doesn’t look like the AMD PCI-E bus bug, you’d know that if you saw it.

The blobio error is one I see when I have a corrupt micrograph or two… EER seems particularly vulnerable to this because it contains so many frames.

I think CryoSPARC Live throws errors in a way which makes it much more obvious which micrographs are failing… out of curiosity, I’d try feeding the mics to CS Live and see where it complains.

Hi everyone, Please be advised that this issue has been resolved in the recently released CryoSPARC v.4.1.2.

Take care!