I followed the installation instructions (https://cryosparc.com/docs/reference/install/) and installed cryoSPARC v2.2 on a cluster. Everything seems to be find and I can see the name of cluster under Resource Manager-> Compute Configuration on the cryoSPARC GUI. However, when I create a new job (e.g., T20S Tutorial), I can see under the Metadata tab the following information:
“job_dir”: “J2”,
“errors_run”: [],
“queued_to_lane”: “my_cluster”,
“run_on_master_direct”: true,
“version”: “v2.2.0”,
and it runs on the master node. It does not attempt to submit a job. Could you please help me understand what I am doing wrong. I found a similar post (cryoSPARC v2 cluster install) but does not provide a solution. I tried to reinstall the software and it did not help me.
I received this reply from @apunjani which I think should resolve my issue. I will try a more complicated job to see if it actually starts a SLURM job.
I’ve been working on the cryoSPARC SLURM submission script and I wanted to share some insights about it.
I noticed that the CTF calculation doesn’t use GPUs, so for this kind of job the SLURM script shouldn’t mention “–gres=gpu:0” because it will implicitly tell SLURM to consider every other values on a “per node” basis. Let’s say that you want to do the CTF calculation with 100 CPUs, then having the “–gres=gpu:0” directive in the script will tell SLURM to search for a single node with 100 CPUs…
Thanks to the jinja2 template engine It’s possible to split CPU jobs from GPU jobs. Also, if you have a cluster with the same number of GPUs on each node then you can add a rule for calculating the number of nodes and number of GPUs per node that you need to satisfy “num_gpu”.
Here’s a simplified version of my script; I didn’t launch all possible kind of jobs so it might be incomplete:
If like me your nodes don’t have 4 CPUs per GPU (needed for 3D refinements, 3D classifications, etc) then you can add a rule to reduce the number of requested CPUs based on the value of “num_cpu/num_gpu”.
In my real case tests, not even once a job hit 200% CPU when allocated more than 1 CPU per GPU…
I also have problem to queue cryosparc v2 job through slurm. My job always request too many cpu and sbatch would report error as “error: Batch job submission failed: Requested node configuration is not available”. In my cluster, queue relion job through slurm is OK. Did you meet problem like this?
Thank you. The CUDA we used is CUDA10.0. We are trying to run a MotionCor2 job. I request 10 GPUs since my cluster has 8 GPUs per workstation. But it reported an error: "Batch job submission failed: Requested node configuration is not available. " However, when I request 4 GPUs, it reported the same error. After I removed “#SBATCH --gres:gpu:{{ num_gpu }}” in my cluster_scripts.sh, my job could launch.