Recommendation for lane configuration

closed

(Nikolaos) #1

Hi,

What are cryoSPARC jobs? Are they serial, OpenMP, or MPI? I also see that mostly they are python. Is it Correct?

Can you give a hint how to configure lanes for slurm workload or provide link where this is described.
I see that cryoSPARC automatically set number of CPUs, GPUs, RAM when it build jobs.
When I do testing using below lane I see that cryoSPARC which require 2 GPUs and 12 CPUS runs 12 copies of the same job:

Now trying to schedule J131
  Need slots :  {u'GPU': 2, u'RAM': 4, u'CPU': 12}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on test
    Launchable:  True
    Alloc slots :  {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
...
Changed job P40.J131 status launched
---------- Scheduler done ------------------
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status running
 Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
 Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status completed

Ok. I saw that it was python code and probably it using threads as it wanted to have 12 CPUs. So I reconfigure lane to have --ntasks=1 & --cpus-per-task={{ num_cpu }} then I have only one copy of code and job ran fine:

Now trying to schedule J153
  Need slots :  {u'GPU': 2, u'RAM': 4, u'CPU': 12}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on test
    Launchable:  True
    Alloc slots :  {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
...
Changed job P40.J153 status launched
---------- Scheduler done ------------------
Changed job P40.J153 status started
Changed job P40.J153 status running
Changed job P40.J153 status completed

Or if I use {{ run_cmd }} instead of srun {{ run_cmd }} together with --ntasks={{ num_cpu }} & --cpus-per-task=1 I also have one job instead of 12. But if we are using slurm we have to use srun command to run code to be sure we are correctly using resources.

Also when I check GPUs usage for this test (with 2 GPUs) I see

|   0  GeForce GTX 108...  On   | 00000000:11:00.0 Off |                  N/A |
| 43%   75C    P2    88W / 250W |    526MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:AE:00.0 Off |                  N/A |
| 45%   78C    P2   216W / 250W |   7574MiB / 11178MiB |     96%      Default |
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-----------------------------------------------------------------------------+
|    0    173632      C   python                                       340MiB |
|    0    173633      C   python                                       173MiB |
|    1    173633      C   python                                      7563MiB |

one process uses 2 GPUs. Is it correct behavior?

As an enhancement: it would be good also to choose in job builder how many CPUs to use like this is right now for some jobs when user can choose number of GPUs.

Also what would be your recommendation for cache_quota_mb and cache_reserve_mb where SSD disk is 2TB?

Are these parameters used globally per disk or per job or per lane?

Thanx.

cryoSPARC is 2.9.0

[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_info(‘test’)”
{
“qdel_cmd_tpl”: “scancel {{ cluster_job_id }}”,
“worker_bin_path”: “/home/cryosparc/cryosparc2_worker/bin”,
“title”: “test”,
“cache_path”: “”,
“qinfo_cmd_tpl”: “sinfo”,
“qsub_cmd_tpl”: “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl”: “squeue -j {{ cluster_job_id }}”,
“send_cmd_tpl”: “{{ command }}”,
“name”: “test”
}
[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_script(‘test’)”
#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH --ntasks={{ num_cpu }}
#SBATCH --mem={{ (ram_gb*1000)|int }}M
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --gres-flags=enforce-binding

srun {{ run_cmd }}

(Stephan Arulthasan) #2

HI @turnik,

You are correct, cryoSPARC “jobs” are just python processes.
Our cluster configuration documentation can be found on our installation guide.

CryoSPARC jobs have been profiled and tuned to allocate the optimal amount of resources so that jobs do not fail due to insufficient resources.

In terms of the SSD, cache_quota_mb can be set to whatever you like (under 2TB in your case). This parameter ensures cryoSPARC does not use more than the set amount in case other programs/users use the SSD. You can leave this at “None” if you want to allow cryoSPARC to use the entire drive’s space. These parameters are used for that node only.


(Nikolaos) #3

Hi Stephan,

which job I can use to test that SSD caching is used? For instance, does Motion Correction job from T20S Tutorial uses SSD? I see in logs that it does not: Need fixed : {u’SSD’: False} & Alloc fixed : {u’SSD’: False}
Lane configured as

“cache_path”: “/prj/cryosparc/cache”,
“cache_quota_mb”: “None”,
“cache_reserve_mb”: 10000,

Thanx.


(Stephan Arulthasan) #4

Hi @turnik,

You can use any job that uses particles to test the SSD cache (ab initio, 2d class, homogeneous refinement, etc.)