Job failed with "core dumped" error

73km · September 28, 2022, 2:48pm

Hi,
My jobs are failing spitting out the following error. The jobs are sent by slurm scheduler. I freshly installed cryosparc. Can anyone advise?

tail job.log
Traceback (most recent call last):
File “”, line 1, in
File “cryosparc_worker/cryosparc_compute/run.py”, line 173, in cryosparc_compute.run.run
File “/raid-18/LS/Programs/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 1961, in get_gpu_info
import pycuda.driver as cudrv
File “/raid-18/LS/Programs/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/driver.py”, line 62, in
from pycuda._driver import * # noqa
ImportError: /raid-18/LS/Programs/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/_driver.cpython-37m-x86_64-linux-gnu.so: undefined symbol: cuDevicePrimaryCtxRelease_v2
libgcc_s.so.1 must be installed for pthread_cancel to work
/raid-18/LS/Programs/cryosparc/cryosparc_worker/bin/cryosparcw: line 120: 9720 Aborted (core dumped) python -c “import cryosparc_compute.run as run; run.run()” “$@”

Here is the submission command from the web interface.

==========================================================================

-------- Submission command:
sbatch /home/Data/EM/practice/T20S/P2/J29/queue_sub_script.sh

-------- Cluster Job ID:
3512959

-------- Queued on cluster at 2022-09-26 12:02:33.689251

-------- Job status at 2022-09-26 12:02:33.731622
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3512959 gpu cryospar xxd PD 0:00 1 (None)

[CPU: 69.5 MB] Project P2 Job J29 Started

[CPU: 69.5 MB] Master running v3.3.2, worker running v3.3.2

[CPU: 69.6 MB] Working in directory: /home/Data/EM/practice/T20S/P2/J29

[CPU: 69.6 MB] Running on lane Mortimer-GPU

[CPU: 69.6 MB] Resources allocated:

[CPU: 69.6 MB] Worker: Mortimer-GPU

[CPU: 69.6 MB] CPU : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

[CPU: 69.6 MB] GPU : [0, 1]

[CPU: 69.6 MB] RAM : [0, 1, 2, 3]

[CPU: 69.6 MB] SSD : False

[CPU: 69.6 MB] --------------------------------------------------------------
[CPU: 69.6 MB] Importing job module for job type patch_motion_correction_multi…

wtempel · September 28, 2022, 10:00pm

If this CryoSPARC task has previously run successfully on the cluster (using the same CryoSPARC installation), have any external dependencies, such as OS version, CUDA toolkit, changed?
Otherwise, can you confirm that:

No errors occurred during software installation or updates.
The compute node environment matches the environment at installation (of CryoSPARC or its dependencies) time? For example, were CryoSPARC and its dependencies actually installed in an environment comparable to the compute node’s, as opposed to, say, installed in some arbitrary (more modern?) environment, than shared with the compute node(s) via nfs?

73km · September 28, 2022, 10:16pm

Yes, the program worked few months ago. Then our cluster did some upgrades. And this error popped up. I even tried re-installing everything fresh. But still getting the error.

wtempel · September 29, 2022, 12:58am

I suggest, if that hasn’t been done already, installing, under the Linux account “owning” the CryoSPARC instance, on a (prototypical) compute node:

the CUDA toolkit from a “runfile” with the --toolkit, --toolkitpath= and --defaultroot= options
the cryosparc_worker package

and carefully watching for any installation errors.
If there are no errors, I suppose these installations may be shared between “similar” compute nodes with matching shared library and nvidia driver versions.

73km · October 5, 2022, 3:28pm

Hi,
Thank you for the suggestion. The problem occurred when they replaced the gpu with an older architecture.