Jobs crashing on start

Greetings.

We have a system that is shared (SLURM) and installed cryosparc using the cluster_info.json and cluster_script.sh files so that it uses the SLURM queue. The SLURM queue is working fine for non-cryosparc jobs, so the is not a problem.

When we start a job in the cryosparc GUI, it quickly crashes.


-------- Submission command: sbatch /executor/cryoem/userlab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J15/queue_sub_script.sh
Failed to launch! 1

We tried with 0, 1 and with 2 GPUs. We did capture this error message before it disappeared…

ServerError: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 2309, in run_job res = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True).decode() File “/executor/opt/cryoem/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 411, in check_output **kwargs).stdout File “/executor/opt/cryoem/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/subprocess.py”, line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command ‘sbatch /executor/cryoem/conwaylab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J16/queue_sub_script.sh’ returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 1861, in scheduler_run scheduler_run_core(do_run) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 2079, in scheduler_run_core run_job(job[‘project_uid’], job[‘uid’]) # takes care of the cluster case and the node case File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 157, in wrapper raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 150, in wrapper res = func(*args, **kwargs) File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 5121, in enqueue_job scheduler_run() File “/executor/opt/cryoem/cryosparc/cryosparc_master/cryosparc_command/command_core/init.py”, line 157, in wrapper raise ServerError(s.getvalue(), code=400) from e flask_jsonrpc.exceptions.ServerError

Thanks for your assistance in this issue.

Please can you post the content of
/executor/cryoem/userlab/2022-08-18_UA-FapC-QF1_KrF4ecC250np96kxOA100-60eA2/P2/J15/queue_sub_script.sh

If the #SBATCH --output= and #SBATCH --error= options are defined therein, do the respective files contain useful information?

Greetings.

I do not see a queue_sub_script.sh in this folder. Here is the contents:

[root@executor J15]# ls
events.bson gridfs_data job.json

Any ideas on this? I’d really like to get this system running again.

Thanks!

Is this error occurring on the same cluster as mentioned in another topic?

Hello.

No, this is on a workstation.

Thanks!

We would like to understand why the

is apparently not created:

To help us find out, please can you try submission of a clone of that job and inspect the command_core log for relevant entries:
cryosparcm log command_core
Please can you also post the output of this command:
cryosparcm cli "get_scheduler_targets()"

1 Like

Hello.

I resolved this by reinstalling on the workstation. All appears well now.

Thanks for the advice and support!

2 Likes