Hello,
currently our jobs submitted to the slurm queue populate the PXX_JXXXX_slurm.err file. The header looks like this:
slurmstepd: error: couldn’t chdir to `/mnt//cs_cpusrv129_v4.7.0/cryosparc_master’: No such file or directory: going to /tmp instead
ERROR: ld.so: object ‘/mnt/cs_cpusrv129_v4.7.0/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so’ from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
The second line is then repeated several times.
This error started coming up several weeks ago, likely after an update of the CS installation to 4.7.0 (but I can’t say for sure, whether I updated on that day the error appeared first time). A restart of CS did not help.
Jobs do finish despite this error, but this points to an error in installation or (possibly temporary) loss of connection to the master, no? Since I’m also experiencing the no heartbeat error sometimes (even though the limit was raised to 3600s), I’m wondering if that could be related (but it might not).
Any ideas?