I would like to use our computational resources more efficiently. We have several GPU and CPU nodes on a cluster. GPU nodes have 4 GPUs, 20 CPUs and 512/1024 GB Ram. When I submit a job with 4 GPU, some jobs require more than 20 CPUs for 4 GPU therefore the job requests two nodes. I understand that you might have optimized each job type accordingly. However if there is some room for changing CPU to GPU ratio, I would like load a 4GPU job over only 1 node instead of two due to the 4 extra cores requested by the job. I saw another topic similar to this one and you mentioned cryosparcm connect command and referred to the installation documentation but I could not find correct details about how to do this.
I have also seen on the submission scripts that most of the job types asking for very small amount of ram. Is that all that particular job needs or would there be increase in the efficiency if the job asks for more ram?
I would also like to learn, is there a way to cache on the ram instead of ssd?
I think there is not much information about how the resources are allocated for each job and whether if we have control over it. We can manually change submission script template however I would like to learn what would be the ideal cpu/gpu/ram configurations for each type of jobs.
Strange, what kind of job requires more that 4 CPUs per GPU ? What cluster scheduler system are you using ?
Well, I hadn’t been able to launch a multi-node GPU job, it crashes… Now I stick to one GPU node. Also, I had almost the same problem than you (not enough CPUs per GPU) for some kind of jobs so I added some Jinja2 semantics in the template file for reducing the number of CPUs in these cases.
I use SLURM but I guess the type of scheduler system is irrelevant since the cpu/gpu ratio is determined by cryosparc.
I think it was 3D refinement job. (Sorry It was a Motion Correction Job which requires 6 cpu/gpu)
I can run jobs multiple GPUs without problems for most of the jobs. I have a problem right now with 4 million particles on a 2D classification. I guess this one is due to the something unrelated.