GPUs that are shown with GPUlist command are not available

Dear all,

I am struggling with the GPU set up in Cryosparc. I managed to add the three GPUs I have in my cluster so when I run the command Cryosparcw gpulist they are all there (photo below)

Then, I restart cryosparc and when I go to queue the job I see the following, where the one with only two GPUs is the server of the cluster:

If I run several GPUs required jobs, I have the limitation of two at the same time (meaning that it´s only using the 1 and 1 GPUs)

Besides, I can´t queue jobs, they just fail whever there are two jobs already running.

Could you help me?
Thanks in advance

Hi @baratachencho .
Please can you post the output of the command

cryosparcm cli "get_scheduler_targets()"

Hi,
yes sure

Thanks for posting the configuration.

Please can you describe the configuration as you expect it:

  1. Is there one computer that acts as both master and worker?
  2. Is there an additional worker node?
  3. What are the outputs of these commands on each node
    hostname | cut -b1-2
    free -h
    grep -m 1 "model name" /proc/cpuinfo
    nvidia-smi -L
    
  4. Have there been any hardware changes (such as GPU or RAM) upgrades on any of the nodes?

The mention of two different user names in the respective ssh_str fields of the two workers could indicate a misconfiguration. Please review the relevant prerequisites:

  1. CryoSPARC commands, jobs and other processes should be run by a shared, non-privileged (“designated”) Linux user.
  2. The cryosparcw connect command should be run on the applicable (to be connected) worker node by the “designated” Linux user.

Hi,
Yes I am working in a cluster so there is only one computer acting as both master and worker. I Installed cryosparc myself I created an account or user with my user.
The outputs of the commands are the followings:

Answering the question 4. Yes, from the first installation to now, I inslalled a new GPU so I had to update the gpulist. I used the command --connect.

Thanks fro your help

Under that circumstance, one might use the
cryosparcw connect --update option.

In this case, the presence of two targets in the get_scheduler_targets() output might indicate a misconfiguration.
To help help me propose a meaningful reconfiguration suggestion, please can you confirm that you are always using the same computer on the cluster to run CryoSPARC and that the hostname does not change?

Hello,

When I try that command cryosparcw connect --update ´´ it´s asking me for the --worker and --master arguments and I am not sure of them. Are the simply localhost ´´

Yes, I am using the same computer to run CryoSPARC. When I run cryosparcm start I can access with this two links. I do not know how relevant it is so I attach a picture:

May I suggest

  1. determining carefully the non-privileged Linux account that should own the CryoSPARC installation and run CryoSPARC-related Linux processes:
    ls -l /opt/cryosparc/cryosparc_master/bin/cryosparcm
    ps -eo user,cmd | grep cryosparc_
    
  2. under that Linux account, running these commands
    cryosparcm cli "remove_scheduler_lane('default')"
    /opt/cryosparc/cryosparc_worker/bin/cryosparcw connect --worker sie --master sie --port 39000 --ssdpath /scratch/cryosparc_cache
    

Does this help?

Hi,

Thanks for the help. Yes by running the first command I´ve realize that all processes are running as root, but CryoSPARC installation is under an User that´s not even me.

Should I run the second command from that user account?

Thanks

All CryoSPARC commands should be run under the shared, non-privileged Linux account.

Running CryoSPARC processes under the root account is risky and will introduce inconsistent file ownership.

For recovery, you may want to

  1. stop CryoSPARC
  2. fix inconsistent file ownerships (CryoSPARC result, log and database files should typically be owned by the aforementioned non-privileged Linux account)
  3. start CryoSPARC under the aforementioned non-privileged Linux account