Allocating multiple GPUs for a job

DavidF · May 13, 2019, 9:29am

Hello all,

We have encountered an issue where only one GPU [0] is allocated for jobs at cryoSPARC, which is not optimal for Heterogeneous Refinement as it is very memory dependant. We have two GPU (0 and 1) that are identical and both are confirmed to be enabled, following the commands we have found on this page.

So, is there any way of allocating both cards for a single job, considering both are identical, recognized and enabled?

Thank you very much in advance.

sbliven · May 13, 2019, 11:21am

Did you define CUDA_VISIBLE_DEVICES? The default cluster submission script (cluster_script.sh) contains the following setup code:

available_devs=""
for devidx in $(seq 0 15);
do
    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then
        if [[ -z "$available_devs" ]] ; then
            available_devs=$devidx
        else
            available_devs=$available_devs,$devidx
        fi
    fi
done
export CUDA_VISIBLE_DEVICES=$available_devs

donghuachen · March 19, 2020, 11:28pm

Hi All, How can I know if the default script above (cluster_script.sh) for defining CUDA_VISIBLE_DEVICES works?
I installed CryoSPARC v 2.13.2 on a GPU node (with 4 GPUs) of a cluster. Requested the allocation of that GPU node and logged into the node, run:
echo $CUDA_VISIBLE_DEVICES
Nothing showed for the echo command. Should I just define CUDA_VISIBLE_DEVICES as following in my cluster_script.sh?
export CUDA_VISIBLE_DEVICES=0,1,2,3
Thanks so much!

apunjani · March 20, 2020, 2:28pm

Hi @donghuachen, you can definitely hardcode the CUDA_VISIBLE_DEVICES as you wrote, in the template script. The loop in the example tries to figure it out on its own.

donghuachen · March 21, 2020, 4:03am

Hi All,
If I have not specified CUDA_VISIBLE_DEVICES (echo $CUDA_VISIBLE_DEVICES shows nothing), what will CryoSPARC do?
Currently I have two NU Refinement jobs running at the same time, but both log files showed GPU [0]. Does this mean my two NU Refinements are using the same GPU [0]? Thanks.

sbliven · March 27, 2020, 4:50pm

I think that does mean that both jobs are sharing GPU [0]. You could double check by running nvidia-smi while they are running.

Rather than running two jobs, I think what you want is to run a single job with two GPUs. Then cryosparc should assign both of them.