Recommendation for lane configuration

turnik · July 30, 2019, 3:53pm

Hi,

What are cryoSPARC jobs? Are they serial, OpenMP, or MPI? I also see that mostly they are python. Is it Correct?

Can you give a hint how to configure lanes for slurm workload or provide link where this is described.
I see that cryoSPARC automatically set number of CPUs, GPUs, RAM when it build jobs.
When I do testing using below lane I see that cryoSPARC which require 2 GPUs and 12 CPUS runs 12 copies of the same job:

Now trying to schedule J131
  Need slots :  {u'GPU': 2, u'RAM': 4, u'CPU': 12}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on test
    Launchable:  True
    Alloc slots :  {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
...
Changed job P40.J131 status launched
---------- Scheduler done ------------------
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status started
Changed job P40.J131 status running
 Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status running
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
 Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status failed
Changed job P40.J131 status completed

Ok. I saw that it was python code and probably it using threads as it wanted to have 12 CPUs. So I reconfigure lane to have --ntasks=1 & --cpus-per-task={{ num_cpu }} then I have only one copy of code and job ran fine:

Now trying to schedule J153
  Need slots :  {u'GPU': 2, u'RAM': 4, u'CPU': 12}
  Need fixed :  {u'SSD': False}
  Need licen :  True
  Master direct :  False
   Trying to schedule on test
    Launchable:  True
    Alloc slots :  {u'GPU': [0, 1], u'RAM': [0, 1, 2, 3], u'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
    Alloc fixed :  {u'SSD': False}
    Alloc licen :  True
     -- Launchable! -- Launching.
...
Changed job P40.J153 status launched
---------- Scheduler done ------------------
Changed job P40.J153 status started
Changed job P40.J153 status running
Changed job P40.J153 status completed

Or if I use {{ run_cmd }} instead of srun {{ run_cmd }} together with --ntasks={{ num_cpu }} & --cpus-per-task=1 I also have one job instead of 12. But if we are using slurm we have to use srun command to run code to be sure we are correctly using resources.

Also when I check GPUs usage for this test (with 2 GPUs) I see

|   0  GeForce GTX 108...  On   | 00000000:11:00.0 Off |                  N/A |
| 43%   75C    P2    88W / 250W |    526MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:AE:00.0 Off |                  N/A |
| 45%   78C    P2   216W / 250W |   7574MiB / 11178MiB |     96%      Default |
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-----------------------------------------------------------------------------+
|    0    173632      C   python                                       340MiB |
|    0    173633      C   python                                       173MiB |
|    1    173633      C   python                                      7563MiB |

one process uses 2 GPUs. Is it correct behavior?

As an enhancement: it would be good also to choose in job builder how many CPUs to use like this is right now for some jobs when user can choose number of GPUs.

Also what would be your recommendation for cache_quota_mb and cache_reserve_mb where SSD disk is 2TB?

Are these parameters used globally per disk or per job or per lane?

Thanx.

cryoSPARC is 2.9.0

[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_info(‘test’)”
{
“qdel_cmd_tpl”: “scancel {{ cluster_job_id }}”,
“worker_bin_path”: “/home/cryosparc/cryosparc2_worker/bin”,
“title”: “test”,
“cache_path”: “”,
“qinfo_cmd_tpl”: “sinfo”,
“qsub_cmd_tpl”: “sbatch {{ script_path_abs }}”,
“qstat_cmd_tpl”: “squeue -j {{ cluster_job_id }}”,
“send_cmd_tpl”: “{{ command }}”,
“name”: “test”
}

[cryosparc@login home]$ cryosparcm cli “get_scheduler_target_cluster_script(‘test’)”
#!/bin/bash
#SBATCH --job-name=cryosparc_{{ project_uid }}_{{ job_uid }}
#SBATCH --output={{ job_log_path_abs }}
#SBATCH --error={{ job_log_path_abs }}
#SBATCH --ntasks={{ num_cpu }}
#SBATCH --mem={{ (ram_gb*1000)|int }}M
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:{{ num_gpu }}
#SBATCH --gres-flags=enforce-binding

srun {{ run_cmd }}

stephan · July 30, 2019, 5:37pm

HI @turnik,

You are correct, cryoSPARC “jobs” are just python processes.
Our cluster configuration documentation can be found on our installation guide.

CryoSPARC jobs have been profiled and tuned to allocate the optimal amount of resources so that jobs do not fail due to insufficient resources.

In terms of the SSD, cache_quota_mb can be set to whatever you like (under 2TB in your case). This parameter ensures cryoSPARC does not use more than the set amount in case other programs/users use the SSD. You can leave this at “None” if you want to allow cryoSPARC to use the entire drive’s space. These parameters are used for that node only.

turnik · September 10, 2019, 8:28pm

Hi Stephan,

which job I can use to test that SSD caching is used? For instance, does Motion Correction job from T20S Tutorial uses SSD? I see in logs that it does not: Need fixed : {u’SSD’: False} & Alloc fixed : {u’SSD’: False}
Lane configured as
…
“cache_path”: “/prj/cryosparc/cache”,
“cache_quota_mb”: “None”,
“cache_reserve_mb”: 10000,

Thanx.

stephan · September 11, 2019, 5:18pm

Hi @turnik,

You can use any job that uses particles to test the SSD cache (ab initio, 2d class, homogeneous refinement, etc.)

sbliven · February 3, 2020, 12:26pm

What are cryoSPARC jobs? Are they serial, OpenMP, or MPI?

This should definitely be made clearer in the documentation. It seems that most jobs are multi-threaded python processes. However, are there any job types which would support running across multiple nodes?

I’ve tried myself using the --ntasks/srun combo to run multiple python processes in parallel, and this seems to cause problems.

apunjani · February 5, 2020, 7:46pm

@sbliven, all cryoSPARC jobs are python processes that, depending on the job type, use multiple threads or create multiple python subprocesses.
We do not use openMP or MPI. There is currently no support for multi-node processing, only multi-GPU multiprocessing on the same node. All concurrency is done right now by us in the code itself (i.e. the spawning of threads or multiprocessing workers) depending on the parameters set (there is usually a parameter controlling the number of GPUs to parallelize over, for example).