After upgrading to v5 our LSF scripts stopped working on our HPC:
\[CPU: 165.3 MB\]
License is valid.
\[CPU: 165.3 MB\]
Launching job on lane Minerva_h100nvl-40Hr target Minerva_h100nvl-40Hr ...
\[CPU: 165.3 MB\]
Launching job on cluster Minerva_h100nvl-40Hr
\[CPU: 165.3 MB\]
template args: {
"project_uid": "P28",
"job_uid": "J673",
"job_creator": "bajicg01",
"cryosparc_username": "goran.bajic@mssm.edu",
"project_dir_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045",
"job_dir_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673",
"job_log_path_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log",
"job_type": "class_3D",
"worker_bin_path": "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw",
"num_gpu": 1,
"num_cpu": 4,
"ram_gb": 24,
"run_cmd": "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1 ",
"run_args": "--project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth",
"script_path_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh",
"cluster_job_id": null,
"ram_gb_multiplier": "1"
}
\[CPU: 165.3 MB\]
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
## Available variables:
## /sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1 - the complete command string to run the job
## 4 - the number of CPUs needed
## 1 - the number of GPUs needed.
## Note: the code will use this many GPUs starting from dev id 0
## the cluster scheduler or this script have the responsibility
## of setting CUDA_VISIBLE_DEVICES so that the job code ends up
## using the correct cluster-allocated GPUs.
## 24 - the amount of RAM needed in GB
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673 - absolute path to the job directory
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045 - absolute path to the project dir
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log - absolute path to the log file for the job
## /sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw - absolute path to the cryosparc worker command
## --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth - arguments to be passed to cryosparcw run
## P28 - uid of the project
## J673 - uid of the job
## bajicg01 - name of the user that created the job (may contain spaces)
## goran.bajic@mssm.edu - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple LSF script:
#BSUB -J cryosparc_P28_J673
#BSUB -n 1
#BSUB -R affinity\[core(4)\]
#BSUB -q gpu
#BSUB -W 40:00
#BSUB -P acc_glycoprotein
#BSUB -E "mkdir /ssd/glycoprotein\_$LSB_JOBID"
#BSUB -Ep "rm -rf /ssd/glycoprotein\_$LSB_JOBID"
#BSUB -gpu num=1:aff=no
##BSUB -R rusage\[ngpus_excl_p=1\]
##BSUB -R rusage\[mem=24000\]
#BSUB -R rusage\[mem=24GB\]
#BSUB -R h100nvl
#BSUB -o /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/%J.out
#BSUB -e /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/%J.err
export CRYOSPARC_SSD_PATH=/ssd/glycoprotein\_$LSB_JOBID
#ml cuda/11.1
/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1
==========================================================================
==========================================================================
\[CPU: 165.3 MB\]
\-------- Submission command:
/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub < /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh
\[CPU: 165.3 MB\]
Cluster script submission for P28-J673 failed: /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub '<' /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh: Command failed (code 255)
Output: Can't load '/hpc/packages/minerva-centos7/CPAN/5.32.1/lib64/perl5/5.32/auto/LSF/Base/Base.so' for module LSF::Base: liblsf.so: cannot open shared object file: No such file or directory at /usr/lib64/perl5/DynaLoader.pm line 193.
at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Compilation failed in require at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
BEGIN failed--compilation aborted at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Empty job. Job not submitted.
Error:
\[CPU: 165.3 MB\]
**Traceback (most recent call last):
File "core/job_scheduling.py", line 47, in core.job_scheduling.schedule_jobs
File "core/job_scheduling.py", line 281, in core.job_scheduling.schedule_job
File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 567, in launch_job
return launch_job_on_cluster(job, target)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 718, in launch_job_on_cluster
res = processing.check_output(cmd, combine_stderr=True, shell=True, env=cluster.get_cluster_env()).decode()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/processing.py", line 327, in check_output
raise ExecError("Command failed", cmd=\[program, \*args\], code=code, output=output, error=error)
core.processing.ExecError: /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub '<' /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh: Command failed (code 255)
Output: Can't load '/hpc/packages/minerva-centos7/CPAN/5.32.1/lib64/perl5/5.32/auto/LSF/Base/Base.so' for module LSF::Base: liblsf.so: cannot open shared object file: No such file or directory at /usr/lib64/perl5/DynaLoader.pm line 193.
at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Compilation failed in require at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
BEGIN failed--compilation aborted at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Empty job. Job not submitted.**
Basically our bsub command needs some libraries from LD_LIBRARY_PATH but cryosparc unset the env variable for some reason. Putting the env variable in config.sh for both master and worker did not work. Submitting jobs directly from the node is working.
Any advice?