cryoSPARC_worker install onto login node for cluster setup

Hi, Do I must install cryoSPARC_worker on physical GPU nodes in the Cluster cryoSPARC set up? Usually, software is not allowed to be installed onto the computing (GPU) nodes. Is it possible to install cryoSPARC_worker just on the login node, and run instance on GPU nodes with slurm like other typical software?

Welcome to the forum @macstein. As of v4.3.1, the cryosparc_worker directory and the CUDA toolkit need to be “available” on the GPU node. This availability could be achieved with storage that is shared, such as via nfs, between the login and GPU node(s).

Thank you very much!

I am a little confused about “availability”. I have set up Cluster cyroSPARC by providing cluster_info.json and cluster_script.sh in a SLURM environment. Do you mean that, before running the job, the “cryosparc_worker” directory must be on the GPU node (not on the login node or any other space such as shared luster)?

I set up “worker_bin_path” : “/lustre/it_css/users/3008/cryoSPARC/cryosparc_worker/bin/cryosparcw” in cluster_info.json, because my cryosparc_worker folder is in sharable lustre. But this can be wrong?
“worker_bin_path” must indicate path of cryosparc_worker directory in GPU node?

Thank you again.

Correct. If you submit a CryoSPARC job to the cluster and the cluster job is allocated a GPU node, it must be ensured that the command

/lustre/it_css/users/3008/cryoSPARC/cryosparc_worker/bin/cryosparcw

(with some additional subcommands and options) can be executed on the allocated GPU node.

For sure, could you pleas confirm two things?

i)
Let’s say there are two GPU nodes, GPU node1 and GPU_node2.
We must then install cryosparcs_worker on each sharable disk (sdd) on both of the GPU nodes.
ex)
/sdd/cryosparc_worker in GPU node1
/sdd/cryosparc_worker in GPU node2

Then, in cluster_info.json, “worker_bin_path” : “/sdd/cryosparc_worker/bin/cryosparcw”.

Is this correct configuration?

ii)
We just install cryosparcs_worker on sharable disk /lustre.
We don’t install cryosparcs_worker in GPU nodes.
Then, in cluster_info.json, “worker_bin_path” : “/lustre/cryosparc_worker/bin/cryosparcw”.

Can you confirm that this configuration is not feasible?

If the same 11.0 ≤ version ≤ 11.8 of the cuda toolkit is installed on node1 and node2 under the same path, say /opt/cuda-11.8 and compatible with those nodes’ GPUs, then it would suffice to install cryosparc_worker once on /lustre with

cryosparc_worker/install.sh --cudapath /opt/cuda-11.8 --license ...

and share /lustre with node1 and node2.

That is what I did. But when I run job with: cryosparcm cli “enqueue_job(project_uid=‘P1’, job_uid=‘J1’)”, it just print
“failed” without running job. The same job work in standalone cryoSPARC on my laptop.

I used “cryosparcm cluster connect” with:
cluster_info.json:

{
    "name" : "slurmcluster",
    "worker_bin_path" : "/lustre/it_css/users/3008/cryoSPARC/cryosparc_worker/bin/cryosparcw",
    "cache_path" : "/tmp",
    "send_cmd_tpl" : "{{ command }}",
    "qsub_cmd_tpl" : "sbatch {{ script_path_abs }}",
    "qstat_cmd_tpl" : "squeue -j {{ cluster_job_id }}",
    "qdel_cmd_tpl" : "scancel {{ cluster_job_id }}",
    "qinfo_cmd_tpl" : "sinfo"
}

cluster_script.sh.

I just see “failed”. Is there any log or error message which I can tract down with?

I believe I have an answer for the feasibility of installing a worker on /lustre. I will investigate further and open up another discussion, if need. Thank you very much!

The command does not specify lane= (guide). Is there a "default" lane and was enqueuing to "default" intended?

I think the job I imported is meant to run on the master CPU. My default lane is for slurmcluster, which could be the source of the issue.

When I ran the cluster validation test with:

cryosparcm cluster validate slurmcluster --projects_dir /lustre/it_css/users/3008/cryoSPARC/work/validate_test

the results seemed alright except for one error message at end:

...
✓ Successfully deleted cluster job
...

! Error - unable to read cluster submission output file at 
 /lustre/it_css/users/3008/cryoSPARC/work/validate_test/test_cluster_qsub.txt, 
 the cluster may have an issue writing to that path 


Cluster validation completed with errors

GPU worker node can access the /lustre directory. Do you think what can cause this issue?

Please can you confirm that GPU worker nodes can create, read and write files inside

under the Linux account that runs CryoSPARC master and worker processes.

Yes, I could create and write file in “…/validate_test” directory at GPU node I login with my account.

I think it seems to fine except that error message. Job ran on a GPU node and created following files:

/lustre/it_css/users/3008/cryoSPARC/work/validate_test~>ls
rendered_cluster_script_qdel.sh  slurm-4607060.out  
rendered_cluster_script_qsub.sh  test_cluster_qsub.txt
: cat slurm-4607060.out 
Adding package `cuda/11.3.1-465.19.01` to your environment
: cat test_cluster_qsub.txt 
Process started 2023-09-29 19:45:33.246763
NODE : r1t01
PID  : 8034
Executing for 5 secs
Process finished 2023-09-29 19:45:38.251854

Validation success!

Are cryosparc_master processes running under the same, your account?

Yes, it is running in the same my account.

Under that account, can the file

/lustre/it_css/users/3008/cryoSPARC/work/validate_test/test_cluster_qsub.txt

be read on the CryoSPARC master host?