Slurm node can not find master

Hello,

currently our jobs submitted to the slurm queue populate the PXX_JXXXX_slurm.err file. The header looks like this:

slurmstepd: error: couldn’t chdir to `/mnt//cs_cpusrv129_v4.7.0/cryosparc_master’: No such file or directory: going to /tmp instead
ERROR: ld.so: object ‘/mnt/cs_cpusrv129_v4.7.0/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so’ from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

The second line is then repeated several times.

This error started coming up several weeks ago, likely after an update of the CS installation to 4.7.0 (but I can’t say for sure, whether I updated on that day the error appeared first time). A restart of CS did not help.

Jobs do finish despite this error, but this points to an error in installation or (possibly temporary) loss of connection to the master, no? Since I’m also experiencing the no heartbeat error sometimes (even though the limit was raised to 3600s), I’m wondering if that could be related (but it might not).

Any ideas?

I forgot to mention: we are running v4.7.1 since a week, the upgrade from 4.7.0 did not change the behaviour.

Would you like to test if these messages can be avoided if these lines are included among the #SBATCH options inside the cluster script template

#SBATCH --chdir={{ job_dir_abs }}
#SBATCH --export=NONE

There are different types of heartbeat errors and several potential causes. For example, the cluster workload manager may have terminated CryoSPARC worker-related processes on the compute node. In those cases, the job.log file inside the job directory or slurm task out or err files may provide additional information. You might want to create a separate forum topic and post specific heartbeat-related error messages that you observed (and where you observed them).

Thank you very much! The addition of the SBATCH options helped!

Best,
Stefan.