Cluster script submission for {{ job }} failed with exit code 127

questionfromwork · November 23, 2022, 11:04am

Hello cryosparc support team,

What does cryosparc do after logging into the cluster?

Does it have a ssh shell connected to the server?
- An example is
- When a user that execute “ssh user@server.com”, the user would log into the server.
- all command executed on the shell would then be executed on the server.
- - On my installation,
- - It seems that the command is not being sent to the cluster.
- - It seems to me that the sbatch command is being executed locally on the master node.

How does cryosparc send qsub_cmd_tpl to the cluster?

Again, on my installation,
It seems that the command is not being sent to the cluster.
It seems to me that the sbatch command is being executed locally on the master node.
- Am I missing something here? I really need to get it work. Please help.

Error message:

Cluster script submission for {{ job }} failed with exit code 127.
/bin/sh: 1: /usr/bin/sbatch: not found

The below are the cluster_info and cluster_script.

cluster_script
#!/bin/bash

SBATCH --job-name cryo_{{ project_uid }}_{{ job_uid }}
SBATCH -n {{ num_cpu }}
SBATCH --gres=gpu:{{ num_gpu }}
SBATCH --mem={{ (ram_gb*1000)|int }}MB
SBATCH --output={{ job_log_path_abs }}
SBATCH --error={{ job_log_path_abs }}

{{ run_cmd }}

cluster_info

{
    "name" : "some name",
    "worker_bin_path" : "/cryosparc_worker/bin/cryosparcw",
    "cache_path" : "/cryosparc",
    "cache_reserve_mb" : 1000,
    "cache_quota_mb": 100000,
    "send_cmd_tpl" : "ssh user@server.com",
    "qsub_cmd_tpl" : "/usr/bin/sbatch -p gpu {{ script_path_abs }}",
    "send_cmd_tpl" : "{{ command }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo",
    "transfer_cmd_tpl" : "scp {{ src_path }} user@server.com:{{ dest_path }}"
}

Thank you for answering.

Edited:
Added more information.

Our cluster doesn’t support sending command and logging into the cluster at the same time.
For example, “send_cmd_tpl” : “ssh loginnode {{ command }}”, won’t work.
So I split the command as I posted above.

wtempel · November 30, 2022, 6:01pm

Because the correct configuration for CryoSPARC cluster integration depends on the circumstances, here are a few assumptions that I derived from details you provided. Please let us know if any of these assumptions are incorrect:

The computer where CryoSPARC master processes (UI, database, etc.) are running is not part of the slurm cluster.
user

is the Linux account under which CryoSPARC master processes are running
is a shared identity between cluster nodes and the CryoSPARC master computer

server.com is part of the slurm cluster
user can connect from the CryoSPARC master host to server.com using
ssh user@server.com without typing a password and without any additional confirmation prompts

Under these assumptions, you should specify:
"send_cmd_tpl" : "ssh user@server.com {{ command }}"
It may be better to think of this definition in terms of "execute command on server.com" rather than in terms of “logging in”.
I expect that “splitting” the "send_cmd_tpl": would not have the intended effect, but instead create conflicting definitions of "send_cmd_tpl":.

questionfromwork · December 5, 2022, 11:50am

Thank for your reply. Yes. Your assumptions are correct.

More details below:

It is correct. The CryoSPARC master is not on the cluster.
Our setup is:

We created a user account that has the same user id and group id on the slurm cluster and CryoSPARC master instance.
The account has been confirmed can 1. log into the cluster, 2. submit job via sbatch, 3. access “cache_path” : “/cryosparc”
The “cache_path” : “/cryosparc” is a directory on slurm cluster. We mounted it on CryoSPARC master instance via NFS.

user is the account on CryoSPARC master instance that run the cryosparcm start
user can submit job via sbatch to the cluster

server.com is part of the slurm cluster. It is the login node. After login, a user can request an interactive shell via srun or submit a job to cluster via sbatch
Yes. user has access to slurm cluster via password-less ssh

questionfromwork · December 5, 2022, 12:14pm

I have a question about script_path_abs, is there a way to change the path, script_path_abs?

Another concern that I have is about “transfer_cmd_tpl” : “scp {{ src_path }} user@server.com:{{ dest_path }}”.

I believe this will be executed on master node instance.

Is there a way to change src_path, dest_path and could you please provide some examples?

For example, /cryosparc/CS-t1/J66/queue_sub_script.sh, I just need the file name, queue_sub_script.sh, and let admin to type the path in cluster_info.

Thank you.

Previous post below:

Updated.

Issue has been resolved. Thanks.

After updated to your recommended cluster_info.json, I have received a new error.

Cluster script submission for {{ job }} failed with exit code 1
sbatch: error: Unable to open file /cryosparc/CS-t1/J66/queue_sub_script.sh

Resolved with:
My work around is to create same directories on master node instance and then mount it there, so their paths are exactly the same on cluster and master node.

Thank you.

wtempel · December 6, 2022, 2:40pm

I understand that "transfer_cmd_tpl": is optional, and not being used in the management of cluster jobs. I recommend not using this variable in your cluster files.
You mention the /cryosparc path both in the context of "cache_path": and a as a component of a project/job directory, as in
/cryosparc/CS-t1/J66/queue_sub_script.sh . This dual-purpose use is suboptimal.
"cache_path": typically points to a directory on a fast, local disk on the worker node. While this path needs to be the same on all worker nodes, the underlying filesystem does not need not be shared with any other nodes. The purpose of "cache_path": is particle caching, not the persistent storage of project data. If your cluster nodes do not all have fast, host-attached SSDs, you may want to omit the "cache_path": definition from the configuration.
Project directories (and the subordinate job directories), on the other hand, must be shared between the the master and worker nodes.

questionfromwork · December 8, 2022, 11:27am

Thanks. I have dropped the transfer_cmd_tpl.

/cryosparc is ssd on HPC. I was a bit mixed up with the cache_path and I thought the master node requires access to cache_path.

Would you please explain the reason using {{ num_cpu }} to request ntask?
{{ num_cpu }} variable name is a bit confusing, as it is num_cpu not num_task.

Is there a way to specify how many ntask or cpu to request on the web interface?

Thanks.

wtempel · December 20, 2022, 5:17pm

num_cpu is a variable whose value is internally determined by CryoSPARC, depending on job type and, potentially, user-specified job parameters. The base value of num_cpu for a given job type was formulated based on brief profiling during software development, and may include a multiplier when multiple GPUs have been assigned to a job. We chose the num_cpu name for this variable. The administrator(s) of the CryoSPARC instance and the compute cluster must decide, based on specific local circumstances, such as the choice of cluster management software, with which cluster manager parameter num_cpu is used inside cluster_script.sh .