LSF submission script not working on HPC after upgrade to v5

gogo_science · February 26, 2026, 12:27am

After upgrading to v5 our LSF scripts stopped working on our HPC:

\[CPU:  165.3 MB\]

License is valid.

\[CPU:  165.3 MB\]

Launching job on lane Minerva_h100nvl-40Hr target Minerva_h100nvl-40Hr ...

\[CPU:  165.3 MB\]

Launching job on cluster Minerva_h100nvl-40Hr

\[CPU:  165.3 MB\]

template args: {
  "project_uid": "P28",
  "job_uid": "J673",
  "job_creator": "bajicg01",
  "cryosparc_username": "goran.bajic@mssm.edu",
  "project_dir_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045",
  "job_dir_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673",
  "job_log_path_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log",
  "job_type": "class_3D",
  "worker_bin_path": "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw",
  "num_gpu": 1,
  "num_cpu": 4,
  "ram_gb": 24,
  "run_cmd": "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1 ",
  "run_args": "--project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth",
  "script_path_abs": "/sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh",
  "cluster_job_id": null,
  "ram_gb_multiplier": "1"
}

\[CPU:  165.3 MB\]


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
## Available variables:
## /sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1             - the complete command string to run the job
## 4            - the number of CPUs needed
## 1            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 24             - the amount of RAM needed in GB
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673        - absolute path to the job directory
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045    - absolute path to the project dir
## /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log   - absolute path to the log file for the job
## /sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth           - arguments to be passed to cryosparcw run
## P28        - uid of the project
## J673            - uid of the job
## bajicg01        - name of the user that created the job (may contain spaces)
## goran.bajic@mssm.edu - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple LSF script:

#BSUB -J cryosparc_P28_J673
#BSUB -n 1
#BSUB -R affinity\[core(4)\]
#BSUB -q gpu
#BSUB -W 40:00
#BSUB -P acc_glycoprotein
#BSUB -E "mkdir /ssd/glycoprotein\_$LSB_JOBID"
#BSUB -Ep "rm -rf /ssd/glycoprotein\_$LSB_JOBID"
#BSUB -gpu num=1:aff=no
##BSUB -R rusage\[ngpus_excl_p=1\]
##BSUB -R rusage\[mem=24000\]
#BSUB -R rusage\[mem=24GB\]
#BSUB -R h100nvl  
#BSUB -o /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/%J.out
#BSUB -e /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/%J.err

export CRYOSPARC_SSD_PATH=/ssd/glycoprotein\_$LSB_JOBID

#ml cuda/11.1

/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_worker/bin/cryosparcw run --project P28 --job J673 --master lg03a12.chimera.hpc.mssm.edu --port 43000 --timeout 20000 --auth >> /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/job.log 2>&1 















==========================================================================
==========================================================================

\[CPU:  165.3 MB\]

\-------- Submission command: 
/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh

\[CPU:  165.3 MB\]

Cluster script submission for P28-J673 failed: /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub '<' /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh: Command failed (code 255)
    Output: Can't load '/hpc/packages/minerva-centos7/CPAN/5.32.1/lib64/perl5/5.32/auto/LSF/Base/Base.so' for module LSF::Base: liblsf.so: cannot open shared object file: No such file or directory at /usr/lib64/perl5/DynaLoader.pm line 193.
 at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Compilation failed in require at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
BEGIN failed--compilation aborted at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Empty job. Job not submitted.

    Error:

\[CPU:  165.3 MB\]

**Traceback (most recent call last):
  File "core/job_scheduling.py", line 47, in core.job_scheduling.schedule_jobs
  File "core/job_scheduling.py", line 281, in core.job_scheduling.schedule_job
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 567, in launch_job
    return launch_job_on_cluster(job, target)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 718, in launch_job_on_cluster
    res = processing.check_output(cmd, combine_stderr=True, shell=True, env=cluster.get_cluster_env()).decode()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/processing.py", line 327, in check_output
    raise ExecError("Command failed", cmd=\[program, \*args\], code=code, output=output, error=error)
core.processing.ExecError: /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub '<' /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J673/queue_sub_script.sh: Command failed (code 255)
    Output: Can't load '/hpc/packages/minerva-centos7/CPAN/5.32.1/lib64/perl5/5.32/auto/LSF/Base/Base.so' for module LSF::Base: liblsf.so: cannot open shared object file: No such file or directory at /usr/lib64/perl5/DynaLoader.pm line 193.
 at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Compilation failed in require at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
BEGIN failed--compilation aborted at /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/etc/esub.sinai line 30.
Empty job. Job not submitted.**

Basically our bsub command needs some libraries from LD_LIBRARY_PATH but cryosparc unset the env variable for some reason. Putting the env variable in config.sh for both master and worker did not work. Submitting jobs directly from the node is working.

Any advice?

wtempel · February 26, 2026, 7:32pm

Thanks @gogo_science for reporting the error. We identified a likely cause and plan to include a fix in a future release.
If

you are in a position to update the cluster connection for lane Minerva_h100nvl-40Hr on this CryoSPARC instance
and the send_cmd_tpl (guide) is currently set to simply
{{ command }}

Then you may try setting

"send_cmd_tpl": "bash -c '{{ command }}'"

If you try it, please let us know if this helps.

yuj25 · March 5, 2026, 8:27pm

After making the change, cryosparc shows job queued but it does not show in the LSF job queue, so it does not actually submitted to LSF. No LSF job ID generated.

Submitting jobs with command line is working using the same command on the same node where cryosparc is running.

V4.7.1 is working properly without any changes.

wtempel · March 16, 2026, 8:19pm

@gogo_science @yuj25 A fix for the “lost” LD_LIBRARY_PATH is included in CryoSPARC v5.0.3, released today.

gogo_science · March 16, 2026, 8:50pm

Thanks! Will give it a try and report back.

gogo_science · March 17, 2026, 3:44pm

It doesn’t seem to be importing our lanes.

After I updated it prompted this:

Must install update manually on 11 cluster(s):
Minerva_a10080g-40Hr
Minerva_a100-15Hr
Minerva_a100-40Hr
Minerva_a10080g_20H
Minerva_a100-15Hr+
Minerva_Express
Minerva_a100-15Hr-new
Minerva_a10080g_15Hr
Minerva_h100nvl-40Hr
Minerva_h100nvl-15Hr
Minerva_B200_15Hr

I then tried running a job and, indeed, it did not run:

Timestamp: Mar 17, 2026, 11:41:11 AM
CryoSPARC Version: null
P28 J827: 3D Class

Traceback (most recent call last):
  File "core/job_scheduling.py", line 47, in core.job_scheduling.schedule_jobs
  File "core/job_scheduling.py", line 283, in core.job_scheduling.schedule_job
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 567, in launch_job
    return launch_job_on_cluster(job, target)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/jobs.py", line 718, in launch_job_on_cluster
    res = processing.check_output(cmd, combine_stderr=True, shell=True, env=cluster.get_cluster_env()).decode()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/projects/glycoprotein/cryosparc/software/cryosparc_master/core/processing.py", line 328, in check_output
    raise ExecError("Command failed", cmd=[program, *args], code=code, output=output, error=error)
core.processing.ExecError: /hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub '<' /sc/arion/projects/glycoprotein/sandro/CS-flu-fda-045/J827/queue_sub_script.sh: Command failed (code 255)
    Output: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Project name and run limit must be specified using bsub -P and -W.
    Otherwise, please check your "/tmp" file system.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Request aborted by esub. Job not submitted.

    Error:

wtempel · March 17, 2026, 4:02pm

In the cluster case, one may need to manually update cryosparc_worker/ software.

gogo_science · March 17, 2026, 4:07pm

Yes, I did exactly that before starting a job.

[gp-cryoadmin@lg03a12 cryosparc_worker]$ bin/cryosparcw update

Updating… checking versions

Current version v4.7.1 - New version v5.0.3

=============================

Updating worker…

=============================

Deleting old files…

Extracting…

Cleaning up…

Done.

Updating dependencies…

gogo_science · March 18, 2026, 3:33pm

OK, spent about an hour with Claude Code and managed to fix it.
What worked:

1. qsub_cmd_tpl — all lanes were using bsub < {{ script_path_abs }} which failed because CryoSPARC quotes the < literally. Fixed by creating per-lane wrapper scripts that do the stdin redirect internally.

2. cluster.py parse_cluster_job_id() — was taking the last token of bsub output (<premium>.) instead of the last numeric match. Fixed by adding an elif branch to return cluster_job_matches[-1].

wtempel · March 18, 2026, 5:24pm

Thanks @gogo_science for sharing your findings.

nfrasser · April 6, 2026, 2:24pm

Hi @gogo_science thanks again for sharing this, the cluster commands should now be working as expected in the latest CryoSPARC v5.0.4 update! Please let us know if you run into any further issues with this.