On our CryoSPARC instance (type master-worker, 2 identical workers with 8 GPUs each), I had configured all workers in the same lane (“default”). After some weeks now, we decided to put each worker in separate lane, so now there’s the lanes “worker1” and “worker2”. I deleted the default lane and reconnected the workers with:
cryosparcw connect --worker cryosparc-worker1 --master cryosparc-master --port 39000 --ssdpath /home/sparcuser/ssd-cache --newlane --lane worker1
and
cryosparcw connect --worker cryosparc-worker2 --master cryosparc-master --port 39000 --ssdpath /home/sparcuser/ssd-cache --newlane --lane worker2
Now I face the weird situation that worker1 works without problems, but when I try to queue a job (e.g. GPU test job) to lane worker2, I get
GPU not available
but queueing it to a specific GPU of worker2 the job works fine. (And I tested each available GPUs on worker2).
Additional information:
sparcuser@cryosparc-master:~$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/sparcuser/cryosparc/cryosparc_master
Current cryoSPARC version: v4.6.0
----------------------------------------------------------------------------
CryoSPARC process status:
app RUNNING pid 4179, uptime 2:37:24
app_api RUNNING pid 4191, uptime 2:37:23
app_api_dev STOPPED Not started
command_core RUNNING pid 4130, uptime 2:37:42
command_rtp RUNNING pid 4156, uptime 2:37:33
command_vis RUNNING pid 4152, uptime 2:37:35
database RUNNING pid 4026, uptime 2:37:46
----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------
global config variables:
export CRYOSPARC_LICENSE_ID="....."
export CRYOSPARC_MASTER_HOSTNAME="cryosparc-master"
export CRYOSPARC_DB_PATH="/home/sparcuser/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
...
sparcuser@cryosparc-master:~$ cryosparcm log command_core
2024-10-14 11:31:43,987 scheduler_run_core INFO | Running...
2024-10-14 11:31:43,987 scheduler_run_core INFO | Jobs Queued: [('P1', 'J53')]
2024-10-14 11:31:43,989 scheduler_run_core INFO | Licenses currently active : 9
2024-10-14 11:31:43,989 scheduler_run_core INFO | Now trying to schedule J53
2024-10-14 11:31:43,989 scheduler_run_core INFO | Queue status waiting_resources
2024-10-14 11:31:43,989 scheduler_run_core INFO | Queue message GPU not available
2024-10-14 11:31:43,990 scheduler_run_core INFO | Finished
sparcuser@cryosparc-worker2:~$ cryosparcw gpulist
Detected 8 CUDA devices.
id pci-bus name
---------------------------------------------------------------
0 1 NVIDIA RTX 6000 Ada Generation
1 33 NVIDIA RTX 6000 Ada Generation
2 65 NVIDIA RTX 6000 Ada Generation
3 97 NVIDIA RTX 6000 Ada Generation
4 129 NVIDIA RTX 6000 Ada Generation
5 161 NVIDIA RTX 6000 Ada Generation
6 193 NVIDIA RTX 6000 Ada Generation
7 225 NVIDIA RTX 6000 Ada Generation
---------------------------------------------------------------
sparcuser@cryosparc-worker2:~$