After upgrade to 4.0.1 System in confused state

GeneF · October 11, 2022, 7:00pm

Updated (reinstalled) to version 4.0.1. Jobs die immediately after being dispatched to node on cluster. Cryosparc seems very confused. There are 9 jobs that are in the “launched” state but are not running and 4 in the “queued” state. I cannot kill them (get error on bkill) so I cannot clear or delete them. cryosparcm jobstatus claims 2 jobs running; 7 jobs queued.
Also, I get the following messages about every second in command_core.log:

2022-10-11 14:55:43,903 COMMAND.BG_WORKER background_worker ERROR | raise child_exception_type(errno_num, err_msg, err_filename)
2022-10-11 14:55:43,903 COMMAND.BG_WORKER background_worker ERROR | FileNotFoundError: [Errno 2] No such file or directory: ‘bstat’: ‘bstat’

Multiple restarts do not seem to have any effect.
Any hints on what I should try?
Gene

wtempel · October 11, 2022, 9:26pm

Does your cluster_info.json define "qstat_cmd_tpl": and its “siblings” using full paths, or is the path otherwise set for the Linux user under which the CryoSPARC instance runs?

GeneF · October 12, 2022, 11:19am

Full path:
“qstat_cmd_tpl”: “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}”

This worked in v3.
The folks who did the upgrade messed it up a bit by not upgrading both the master and worker. I have a backup of the DB the day before the upgrade. I’ll try to restore that and see if the problem goes away. That constant background worker error in the command_core log bothers me, though
Gene

wtempel · October 12, 2022, 3:32pm

Is bstat mentioned anywhere in your cluster configuration(s)?
cryosparcm cli "get_scheduler_targets()"

GeneF · October 12, 2022, 4:36pm

Restored the last backup before the initial upgrade attempt. Seems to have cleared up the problem with the launched jobs and it seems the folks are back in business.
However, that looping ‘bstat’ is still there and when I check “cryosparcm jobstatus” it reporte 2 running jobs (they aren’t)
Ran the command you sent and greped for bstat. Nothing:
(base) [wackerlab-cryoadmin@lc03g01 backups]$ cryosparcm cli “get_scheduler_targets()” | grep -i bstat
(base) [wackerlab-cryoadmin@lc03g01 backups]$

I remember when we first installed the code, bstat was used for the qstat command until we recognized our error. Could the system sill be looking for the status of those 2 non existent jobs using the wrong command? How can I clean it up?
Gene

wtempel · October 19, 2022, 3:19pm

If you have ensured the jobs are in fact no longer running, you may try

kill them through the web UI (guide) or, if this doesn’t work
cryosparcm cli "set_job_status('<project_uid>', '<job_uid>', 'killed')"

Has CryoSPARC been restarted after you corrected the error?