Illegal job id "submitted." when trying to get status

A little while back I “force killed” a boatload of jobs that were basically zombies. I just noticed that I am now getting repeated “Illegal job id” errors where the job id is “submitted.” Our instance is currently completely idle (except for the admin) I.e., no users have jobs pending or are working on anything. Not sure how to clear this error up. Traceback is below. Thanks for any help.
Gene
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | submitted.: Illegal job ID.
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | Traceback (most recent call last):
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | File “/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 2697, in update_cluster_job_status
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | cluster_job_status = cluster.get_cluster_job_status(target, cluster_job_id)
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | File “/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_master/cryosparc_compute/cluster.py”, line 168, in get_cluster_job_status
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | File “/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 411, in check_output
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | **kwargs).stdout
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | File “/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 512, in run
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | output=stdout, stderr=stderr)
2022-11-07 10:54:48,961 COMMAND.SCHEDULER update_cluster_job_status ERROR | subprocess.CalledProcessError: Command ‘[’/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs’, ‘-l’, ‘submitted.’]’ returned non-zero exit status 255.

Please can you provide additional details:

  • How did you “force kill” those jobs?
  • What is the output of
    cryosparcm cli "get_scheduler_targets()"
    ?

The kill button on the gui didn’t work so at your suggestion, I ran about 30:
cryosparcm cli "set_job_status(’’, ‘’, ‘killed’)

Results of the get_scheduler_targets() command:

(base) [wackerlab-cryoadmin@lc03g01 ~]$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/ssd/wcryosparc_74004454', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'Minerva_a100-2000GB_10hrs', 'lane': 'Minerva_a100-2000GB_10hrs', 'name': 'Minerva_a100-2000GB_10hrs', 'qdel_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}', 'qinfo_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueues', 'qstat_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}', 'qsub_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple LSF script:\n\n#BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}\n#BSUB -n 1\n#BSUB -R affinity[core({{ num_cpu }})]\n#BSUB -q gpu\n#BSUB -W 10:00\n#BSUB -P acc_WL\n#BSUB -E "mkdir /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -Ep "rm -rf /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -R rusage[ngpus_excl_p={{ num_gpu }}]\n#BSUB -R rusage[mem={{ (ram_gb*2000)|int }}]\n#BSUB -R a100  \n#BSUB -o {{ job_dir_abs }}/%J.out\n#BSUB -e {{ job_dir_abs }}/%J.err\n\n#available_devs=""\n#for devidx in $(seq 1 16);\n#do\n#    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n#        i [[ -z "$available_devs" ]] ; then\n#            available_devs=$devidx\n#        else\n#            available_devs=$available_devs,$devidx\n#        fi\n#    fi\n#done\n#export CUDA_VISIBLE_DEVICES=$available_devs\nexport CRYOSPARC_SSD_PATH=/ssd/wcryosparc_$LSB_JOBID\n\nml cuda/11.1\n\n{{ run_cmd }}\n\n\n\n\n\n\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'Minerva_a100 10Hours Wall Clock', 'type': 'cluster', 'worker_bin_path': '/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'Minerva_a100-1000_20hrs', 'lane': 'Minerva_a100-1000_20hrs', 'name': 'Minerva_a100-1000_20hrs', 'qdel_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}', 'qinfo_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueues', 'qstat_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}', 'qsub_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple LSF script:\n\n#BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}\n#BSUB -n 1\n#BSUB -R affinity[core({{ num_cpu }})]\n#BSUB -q gpu\n#BSUB -W 20:00\n#BSUB -P acc_WL\n#BSUB -E "mkdir /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -Ep "rm -rf /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -R rusage[ngpus_excl_p={{ num_gpu }}]\n#BSUB -R rusage[mem={{ (ram_gb*1000)|int }}]\n#BSUB -R a100  \n#BSUB -o {{ job_dir_abs }}/%J.out\n#BSUB -e {{ job_dir_abs }}/%J.err\n\n#available_devs=""\n#for devidx in $(seq 1 16);\n#do\n#    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n#        i [[ -z "$available_devs" ]] ; then\n#            available_devs=$devidx\n#        else\n#            available_devs=$available_devs,$devidx\n#        fi\n#    fi\n#done\n#export CUDA_VISIBLE_DEVICES=$available_devs\nexport CRYOSPARC_SSD_PATH=/ssd/wcryosparc_$LSB_JOBID\n\nml cuda/11.1\n\n{{ run_cmd }}\n\n\n\n\n\n\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'Minerva_a100 20Hours Wall Clock', 'type': 'cluster', 'worker_bin_path': '/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'Minerva_a100g-1000_20hrs', 'lane': 'Minerva_a100g-1000_20hrs', 'name': 'Minerva_a100g-1000_20hrs', 'qdel_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}', 'qinfo_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueues', 'qstat_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}', 'qsub_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple LSF script:\n\n#BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}\n#BSUB -n 1\n#BSUB -R affinity[core({{ num_cpu }})]\n#BSUB -q gpu\n#BSUB -W 20:00\n#BSUB -P acc_WL\n#BSUB -E "mkdir /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -Ep "rm -rf /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -R rusage[ngpus_excl_p={{ num_gpu }}]\n#BSUB -R rusage[mem={{ (ram_gb*1000)|int }}]\n#BSUB -R a10080g  \n#BSUB -o {{ job_dir_abs }}/%J.out\n#BSUB -e {{ job_dir_abs }}/%J.err\n\n#available_devs=""\n#for devidx in $(seq 1 16);\n#do\n#    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n#        i [[ -z "$available_devs" ]] ; then\n#            available_devs=$devidx\n#        else\n#            available_devs=$available_devs,$devidx\n#        fi\n#    fi\n#done\n#export CUDA_VISIBLE_DEVICES=$available_devs\nexport CRYOSPARC_SSD_PATH=/ssd/wcryosparc_$LSB_JOBID\n\nml cuda/11.1\n\n{{ run_cmd }}\n\n\n\n\n\n\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'Minerva_a100g 20Hours Wall Clock', 'type': 'cluster', 'worker_bin_path': '/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'Minerva_a100g-1000_1hr', 'lane': 'Minerva_a100g-1000_1hr', 'name': 'Minerva_a100g-1000_1hr', 'qdel_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}', 'qinfo_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueues', 'qstat_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}', 'qsub_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple LSF script:\n\n#BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}\n#BSUB -n 1\n#BSUB -R affinity[core({{ num_cpu }})]\n#BSUB -q gpu\n#BSUB -W 1:00\n#BSUB -P acc_WL\n#BSUB -E "mkdir /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -Ep "rm -rf /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -R rusage[ngpus_excl_p={{ num_gpu }}]\n#BSUB -R rusage[mem={{ (ram_gb*1000)|int }}]\n#BSUB -R a10080g  \n#BSUB -o {{ job_dir_abs }}/%J.out\n#BSUB -e {{ job_dir_abs }}/%J.err\n\n#available_devs=""\n#for devidx in $(seq 1 16);\n#do\n#    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n#        i [[ -z "$available_devs" ]] ; then\n#            available_devs=$devidx\n#        else\n#            available_devs=$available_devs,$devidx\n#        fi\n#    fi\n#done\n#export CUDA_VISIBLE_DEVICES=$available_devs\nexport CRYOSPARC_SSD_PATH=/ssd/wcryosparc_$LSB_JOBID\n\nml cuda/11.1\n\n{{ run_cmd }}\n\n\n\n\n\n\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'Minerva_a100g 1hour Wall Clock', 'type': 'cluster', 'worker_bin_path': '/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}, {'cache_path': '', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'hostname': 'Minerva_a1008g-1000_1hr-no-memory', 'lane': 'Minerva_a1008g-1000_1hr-no-memory', 'name': 'Minerva_a1008g-1000_1hr-no-memory', 'qdel_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bkill {{ cluster_job_id }}', 'qinfo_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bqueues', 'qstat_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}', 'qsub_cmd_tpl': '/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple LSF script:\n\n#BSUB -J cryosparc_{{ project_uid }}_{{ job_uid }}\n#BSUB -n 1\n#BSUB -R affinity[core({{ num_cpu }})]\n#BSUB -q gpu\n#BSUB -W 1:00\n#BSUB -P acc_WL\n#BSUB -E "mkdir /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -Ep "rm -rf /ssd/wcryosparc_$LSB_JOBID"\n#BSUB -R rusage [ngpus_excl_p=1]\n##BSUB -R rusage[ngpus_excl_p={{ num_gpu }}]\n##BSUB -R rusage[mem={{ (ram_gb*1000)|int }}]\n#BSUB -R a10080g  \n#BSUB -o {{ job_dir_abs }}/%J.out\n#BSUB -e {{ job_dir_abs }}/%J.err\n\n#available_devs=""\n#for devidx in $(seq 1 16);\n#do\n#    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n#        i [[ -z "$available_devs" ]] ; then\n#            available_devs=$devidx\n#        else\n#            available_devs=$available_devs,$devidx\n#        fi\n#    fi\n#done\n#export CUDA_VISIBLE_DEVICES=$available_devs\nexport CRYOSPARC_SSD_PATH=/ssd/wcryosparc_$LSB_JOBID\n\nml cuda/11.1\n\n{{ run_cmd }}\n\n\n\n\n\n\n\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'Minerva_a100g 1hour-noMem Wall Clock', 'type': 'cluster', 'worker_bin_path': '/sc/arion/projects/WL/cryosparc/software/cryosparc/cryosparc_worker/bin/cryosparcw'}]
(base) [wackerlab-cryoadmin@lc03g01 ~]$

A valid cluster job id is required for CryoSPARC to interact (to obtain status, to terminate) with the cluster job after submission.
CryoSPARC infers the cluster job id from the output of the (interpreted) "qsub_cmd_tpl" command.
In this case of bsub, this inference apparently fails.
A fix may be possible if you pipe bsub output in to some_filter:

"qsub_cmd_tpl": "/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bsub <  {{ script_path_abs }} | some_filter"

such that some_filter outputs a valid job id for your cluster. CryoSPARC should then be able to capture that job id and interact with the job after submission.
The implementation of some_filter depends on the format of “unfiltered” bsub output on your cluster.
Do the COMMAND.SCHEDULER errors continue to be written to command_core.log long after you ran set_job_status()?

I’ll put a filter in but right now there are no jobs running or queued or building in the system and the queries for the status of job id “submit” are still continuing. The COMMAND.SCHEDULER errors continue. I did a list_projects and see there are still quite a few jobs “building” but nothing “queued” or “running”. Any ideas on how to stop it?
One other thing, one of the students left and a few of the folders he used to run cryosparc were deleted. Could that cause the problem?

Hi @GeneF,

Deleting CryoSPARC files can definitely cause issues with the system and cause errors that need to be manually fixed.

Could you please try the following:

  1. Set projects to deleted manually for each project where the project folder was deleted
cryosparcm icli
db.projects.update_one({'uid':'P1'}, {'$set': {'deleted':True}})
  1. Set jobs to killed status manually for each job where the cluster id was set to “submitted”
cryosparcm icli
# do this for each problematic job
db.jobs.update_one({'project_uid': 'P1', 'uid':'J1'}, {'$set': {'status':'killed'}})
# or to update all of them at once
db.jobs.update_many({'cluster_job_id': 'submitted.'}, {'$set': {'status':'killed'}})

For future use, projects in CryoSPARC should be deleted in the webapp or with cryosparcm cli "delete_project()" before the project folder is deleted.

Thanks to all. I think I finally got the database under control.