Hi,
What are cryoSPARC jobs? Are they serial, OpenMP, or MPI? I also see that mostly they are python. Is it Correct?
Can you give a hint how to configure lanes for slurm workload or provide link where this is described.
I see that cryoSPARC automatically set number of CPUs, GPUs, RAM when it build jobs.
When I do testing using below lane I see that cryoSPARC which require 2 GPUs and 12 CPUS runs 12 copies of the same job:
Now trying to schedule J131
Need slots : {u'GPU': 2, u'RAM': 4, u'CPU': 12}
Need fixed : {u'SSD': False}
Need licen : True
Master direct : False
Trying to schedule on test
Launchable: True
Alloc slots : {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
Alloc fixed : {u'SSD': False}
Alloc licen : True
-- Launchable! -- Launching.
...
Changed job P40.J131 status launched
---------- Scheduler done ------------------
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status completed
Ok. I saw that it was python code and probably it using threads as it wanted to have 12 CPUs. So I reconfigure lane to have --ntasks=1 & --cpus-per-task={{ num_cpu }}
then I have only one copy of code and job ran fine:
Now trying to schedule J153
Need slots : {u'GPU': 2, u'RAM': 4, u'CPU': 12}
Need fixed : {u'SSD': False}
Need licen : True
Master direct : False
Trying to schedule on test
Launchable: True
Alloc slots : {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
Alloc fixed : {u'SSD': False}
Alloc licen : True
-- Launchable! -- Launching.
...
Changed job P40.J153 status launched
---------- Scheduler done ------------------
Changed job P40.J153 status started
Changed job P40.J153 status running
Changed job P40.J153 status completed
Or if I use {{ run_cmd }}
instead of srun {{ run_cmd }}
together with --ntasks={{ num_cpu }} & --cpus-per-task=1
I also have one job instead of 12. But if we are using slurm we have to use srun command to run code to be sure we are correctly using resources.
Also when I check GPUs usage for this test (with 2 GPUs) I see
| 0 GeForce GTX 108... On | 00000000:11:00.0 Off | N/A |
| 43% 75C P2 88W / 250W | 526MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:AE:00.0 Off | N/A |
| 45% 78C P2 216W / 250W | 7574MiB / 11178MiB | 96% Default |
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
+-----------------------------------------------------------------------------+
| 0 173632 C python 340MiB |
| 0 173633 C python 173MiB |
| 1 173633 C python 7563MiB |
one process uses 2 GPUs. Is it correct behavior?
As an enhancement: it would be good also to choose in job builder how many CPUs to use like this is right now for some jobs when user can choose number of GPUs.
Also what would be your recommendation for cache_quota_mb
and cache_reserve_mb
where SSD disk is 2TB?
Are these parameters used globally per disk or per job or per lane?
Thanx.
cryoSPARC is 2.9.0
[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_info(‘test’)”
{
“qdel_cmd_tpl”: “scancel {{ cluster_job_id }}”,
“worker_bin_path”: “/home/cryosparc/cryosparc2_worker/bin”,
“title”: “test”,
“cache_path”: “”,
“qinfo_cmd_tpl”: “sinfo”,
“qsub_cmd_tpl”: “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl”: “squeue -j {{ cluster_job_id }}”,
“send_cmd_tpl”: “{{ command }}”,
“name”: “test”
}
[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_script(‘test’)”
#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH --ntasks={{ num_cpu }}
#SBATCH --mem={{ (ram_gb*1000)|int }}M
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --gres-flags=enforce-binding
srun {{ run_cmd }}