Jobs crashing on start

yodamoppet · August 25, 2022, 3:51pm

Greetings.

We have a system that is shared (SLURM) and installed cryosparc using the cluster_info.json and cluster_script.sh files so that it uses the SLURM queue. The SLURM queue is working fine for non-cryosparc jobs, so the is not a problem.

When we start a job in the cryosparc GUI, it quickly crashes.

-------- Submission command: sbatch /executor/cryoem/userlab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J15/queue_sub_script.sh
Failed to launch! 1

We tried with 0, 1 and with 2 GPUs. We did capture this error message before it disappeared…

ServerError: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 2309, in run_job res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode() File “/executor/opt/cryoem/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 411, in check_output **kwargs).stdout File “/executor/opt/cryoem/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command ‘sbatch /executor/cryoem/conwaylab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J16/queue_sub_script.sh’ returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 1861, in scheduler_run scheduler_run_core(do_run) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 2079, in scheduler_run_core run_job(job[‘project_uid’], job[‘uid’]) # takes care of the cluster case and the node case File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 157, in wrapper raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 5121, in enqueue_job scheduler_run() File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 157, in wrapper raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError

Thanks for your assistance in this issue.

wtempel · August 26, 2022, 7:13pm

Please can you post the content of
/executor/cryoem/userlab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J15/queue_sub_script.sh

If the #SBATCH --output= and #SBATCH --error= options are defined therein, do the respective files contain useful information?

yodamoppet · August 29, 2022, 12:35pm

Greetings.

I do not see a queue_sub_script.sh in this folder. Here is the contents:

[root@executor J15]# ls
events.bson gridfs_data job.json

yodamoppet · August 30, 2022, 4:41pm

Any ideas on this? I’d really like to get this system running again.

Thanks!

wtempel · September 15, 2022, 1:56pm

Is this error occurring on the same cluster as mentioned in another topic?

yodamoppet · September 15, 2022, 2:14pm

Hello.

No, this is on a workstation.

Thanks!

wtempel · September 16, 2022, 4:09pm

We would like to understand why the

is apparently not created:

To help us find out, please can you try submission of a clone of that job and inspect the command_core log for relevant entries:
cryosparcm log command_core
Please can you also post the output of this command:
cryosparcm cli "get_scheduler_targets()"

yodamoppet · September 19, 2022, 3:34pm

Hello.

I resolved this by reinstalling on the workstation. All appears well now.

Thanks for the advice and support!