Updated (reinstalled) to version 4.0.1. Jobs die immediately after being dispatched to node on cluster. Cryosparc seems very confused. There are 9 jobs that are in the “launched” state but are not running and 4 in the “queued” state. I cannot kill them (get error on bkill) so I cannot clear or delete them. cryosparcm jobstatus claims 2 jobs running; 7 jobs queued.
Also, I get the following messages about every second in command_core.log:
2022-10-11 14:55:43,903 COMMAND.BG_WORKER background_worker ERROR | raise child_exception_type(errno_num, err_msg, err_filename)
2022-10-11 14:55:43,903 COMMAND.BG_WORKER background_worker ERROR | FileNotFoundError: [Errno 2] No such file or directory: ‘bstat’: ‘bstat’
Multiple restarts do not seem to have any effect.
Any hints on what I should try?
Gene
Does your cluster_info.json define "qstat_cmd_tpl": and its “siblings” using full paths, or is the path otherwise set for the Linux user under which the CryoSPARC instance runs?
Full path:
“qstat_cmd_tpl”: “/hpc/lsf/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -l {{ cluster_job_id }}”
This worked in v3.
The folks who did the upgrade messed it up a bit by not upgrading both the master and worker. I have a backup of the DB the day before the upgrade. I’ll try to restore that and see if the problem goes away. That constant background worker error in the command_core log bothers me, though
Gene
Restored the last backup before the initial upgrade attempt. Seems to have cleared up the problem with the launched jobs and it seems the folks are back in business.
However, that looping ‘bstat’ is still there and when I check “cryosparcm jobstatus” it reporte 2 running jobs (they aren’t)
Ran the command you sent and greped for bstat. Nothing:
(base) [wackerlab-cryoadmin@lc03g01 backups]$ cryosparcm cli “get_scheduler_targets()” | grep -i bstat
(base) [wackerlab-cryoadmin@lc03g01 backups]$
I remember when we first installed the code, bstat was used for the qstat command until we recognized our error. Could the system sill be looking for the status of those 2 non existent jobs using the wrong command? How can I clean it up?
Gene