Gpu not available

On our CryoSPARC instance (type master-worker, 2 identical workers with 8 GPUs each), I had configured all workers in the same lane (“default”). After some weeks now, we decided to put each worker in separate lane, so now there’s the lanes “worker1” and “worker2”. I deleted the default lane and reconnected the workers with:

cryosparcw  connect --worker cryosparc-worker1  --master cryosparc-master --port 39000 --ssdpath /home/sparcuser/ssd-cache --newlane --lane worker1

and

cryosparcw  connect --worker cryosparc-worker2  --master cryosparc-master --port 39000 --ssdpath /home/sparcuser/ssd-cache --newlane --lane worker2

Now I face the weird situation that worker1 works without problems, but when I try to queue a job (e.g. GPU test job) to lane worker2, I get

GPU not available

but queueing it to a specific GPU of worker2 the job works fine. (And I tested each available GPUs on worker2).

Additional information:

sparcuser@cryosparc-master:~$ cryosparcm status
----------------------------------------------------------------------------
CryoSPARC System master node installed at
/home/sparcuser/cryosparc/cryosparc_master
Current cryoSPARC version: v4.6.0
----------------------------------------------------------------------------

CryoSPARC process status:

app                              RUNNING   pid 4179, uptime 2:37:24
app_api                          RUNNING   pid 4191, uptime 2:37:23
app_api_dev                      STOPPED   Not started
command_core                     RUNNING   pid 4130, uptime 2:37:42
command_rtp                      RUNNING   pid 4156, uptime 2:37:33
command_vis                      RUNNING   pid 4152, uptime 2:37:35
database                         RUNNING   pid 4026, uptime 2:37:46

----------------------------------------------------------------------------
License is valid
----------------------------------------------------------------------------

global config variables:
export CRYOSPARC_LICENSE_ID="....."
export CRYOSPARC_MASTER_HOSTNAME="cryosparc-master"
export CRYOSPARC_DB_PATH="/home/sparcuser/cryosparc/cryosparc_database"
export CRYOSPARC_BASE_PORT=39000
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX='CS-'
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true
...
sparcuser@cryosparc-master:~$ cryosparcm log command_core
2024-10-14 11:31:43,987 scheduler_run_core   INFO     | Running...
2024-10-14 11:31:43,987 scheduler_run_core   INFO     | Jobs Queued: [('P1', 'J53')]
2024-10-14 11:31:43,989 scheduler_run_core   INFO     | Licenses currently active : 9
2024-10-14 11:31:43,989 scheduler_run_core   INFO     | Now trying to schedule J53
2024-10-14 11:31:43,989 scheduler_run_core   INFO     |     Queue status waiting_resources
2024-10-14 11:31:43,989 scheduler_run_core   INFO     |     Queue message GPU not available
2024-10-14 11:31:43,990 scheduler_run_core   INFO     | Finished

sparcuser@cryosparc-worker2:~$ cryosparcw gpulist
  Detected 8 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0                 1  NVIDIA RTX 6000 Ada Generation                                                                
       1                33  NVIDIA RTX 6000 Ada Generation                                                                
       2                65  NVIDIA RTX 6000 Ada Generation                                                                
       3                97  NVIDIA RTX 6000 Ada Generation                                                                
       4               129  NVIDIA RTX 6000 Ada Generation                                                                
       5               161  NVIDIA RTX 6000 Ada Generation                                                                
       6               193  NVIDIA RTX 6000 Ada Generation                                                                
       7               225  NVIDIA RTX 6000 Ada Generation                                                                
   ---------------------------------------------------------------
sparcuser@cryosparc-worker2:~$ 

Welcome to the forum @widu, and thanks for posting information relevant to your question. Please can you additionally post the output oft he command (on cryosparc-master):

cryosparcm cli "get_scheduler_targets()"

Thanks for looking into my problem @wtempel.

sparcuser@cryosparc-master:/root$ cryosparcm cli "get_scheduler_targets()"
[{'cache_path': '/home/sparcuser/ssd-cache',
  'cache_quota_mb': None, 
  'cache_reserve_mb': 10000, 
  'desc': None, 
  'gpus': [
    {'id': 0, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 1, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 2, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 3, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 4, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 5, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 6, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 7, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}], 
  'hostname': 'cryosparc-worker1', 
  'lane': 'worker1', 
  'monitor_port': None, 
  'name': 'cryosparc-worker1', 
  'resource_fixed': {'SSD': True}, 
  'resource_slots': {
    'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 
    'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 
    'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192]}, 
  'ssh_str': 'sparcuser@cryosparc-worker1', 
  'title': 'Worker node cryosparc-worker1', 
  'type': 'node', 
  'worker_bin_path': '/home/sparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'},
 {'cache_path': '/home/sparcuser/ssd-cache', 
  'cache_quota_mb': None, 
  'cache_reserve_mb': 10000, 
  'desc': None, 
  'gpus': [
    {'id': 0, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 1, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 2, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 3, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 4, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 5, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 6, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}, 
    {'id': 7, 'mem': 51002867712, 'name': 'NVIDIA RTX 6000 Ada Generation'}], 
  'hostname': 'cryosparc-worker2', 
  'lane': 'worker2', 
  'monitor_port': None, 
  'name': 'cryosparc-worker2', 
  'resource_fixed': {'SSD': True}, 
  'resource_slots': {
    'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], 
    'GPU': [0, 1, 2, 3, 4, 5, 6, 7], 
    'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192]}, 
  'ssh_str': 'sparcuser@cryosparc-worker2', 
  'title': 'Worker node cryosparc-worker2', 
  'type': 'node', 
  'worker_bin_path': '/home/sparcuser/cryosparc/cryosparc_worker/bin/cryosparcw'}]

Thanks for posting the target info.

  1. Does the problem persist after running
    cryosparcm restart? (Caution: Run the command when no CryoSPARC jobs are running, as the restart would disrupt running CryoSPARC jobs).
  2. Do you get GPU not available for for other job types also?
  3. Please can you post the output of the command
    csprojectid=P99 # replace with actual project ID
    csjobid=J199 # replace with id of failed GPU test job
    cryosparcm eventlog $csprojectid $csjobid
    cryosparcm joblog $csprojectid $csjobid | tail -n 20
    ssh sparcuser@cryosparc-worker2 "hostname && nvidia-smi"
    

Does the problem persist after running cryosparcm restart?

yes.

Do you get GPU not available for for other job types also?

yes

a) when queueing to a specific GPU

"
sparcuser@cryosparc-master:~$ cryosparcm eventlog $csprojectid $csjobid
[Mon, 21 Oct 2024 14:08:36 GMT]  License is valid.
[Mon, 21 Oct 2024 14:08:36 GMT]  Launching job on lane worker2 target cryosparc-worker2 ...
[Mon, 21 Oct 2024 14:08:36 GMT]  Running job on remote worker node hostname cryosparc-worker2
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Job J53 Started
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Master running v4.6.0, worker running v4.6.0
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Working in directory: /home/sparcuser/homes/widu/CS-widustests/J53
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Running on lane worker2
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Resources allocated:
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB]   Worker:  cryosparc-worker2
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB]   CPU   :  [48]
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB]   GPU   :  [0]
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB]   RAM   :  [16]
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB]   SSD   :  True
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] --------------------------------------------------------------
[Mon, 21 Oct 2024 14:08:45 GMT] [CPU RAM used: 88 MB] Importing job module for job type worker_gpu_test...
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 225 MB] Job ready to run
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 225 MB] ***************************************************************
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] Obtaining GPU info via `nvidia-smi`...
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:01:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :33
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:21:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :32
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:41:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :33
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:61:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :31
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:81:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :33
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:A1:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :32
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:C1:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :33
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB] NVIDIA RTX 6000 Ada Generation @ 00000000:E1:00.0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     driver_version                :550.90.07
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     persistence_mode              :Enabled
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     power_limit                   :300.00
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     sw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     hw_power_limit                :Not Active
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     compute_mode                  :Default
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     max_pcie_link_gen             :4
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     current_pcie_link_gen         :1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     temperature                   :31
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     gpu_utilization               :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 256 MB]     memory_utilization            :0
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 347 MB] Starting GPU test on: NVIDIA RTX 6000 Ada Generation @ 1
[Mon, 21 Oct 2024 14:08:52 GMT] [CPU RAM used: 347 MB]     With CUDA Toolkit version: 11.8
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Finished GPU test in 0.784s
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Tensorflow test skipped.
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] PyTorch test skipped.
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] --------------------------------------------------------------
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Compiling job outputs...
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Updating job size...
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Exporting job and creating csg files...
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] ***************************************************************
[Mon, 21 Oct 2024 14:08:53 GMT] [CPU RAM used: 397 MB] Job complete. Total time 1.82s

sparcuser@cryosparc-master:~$ cryosparcm joblog $csprojectid $csjobid | tail -n 20
instance_testing.run cryosparc_compute.jobs.jobregister
/home/sparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/sparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/sparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/sparcuser/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
MONITOR PROCESS PID 4059
========= monitor process now waiting for main process
========= sending heartbeat at 2024-10-21 14:08:48.587815
***************************************************************
***************************************************************
========= main process now complete at 2024-10-21 14:08:53.853621
Total: 2.278s
  MAIN THREAD:

========= main process now complete at 2024-10-21 14:08:58.604742.
========= monitor process now complete at 2024-10-21 14:08:58.610132.

sparcuser@cryosparc-master:~$ ssh sparcuser@cryosparc-worker2 "hostname && nvidia-smi"
cryosparc-worker2
Mon Oct 21 14:13:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   34C    P8             22W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:21:00.0 Off |                  Off |
| 30%   32C    P8             23W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:41:00.0 Off |                  Off |
| 30%   33C    P8             21W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:61:00.0 Off |                  Off |
| 30%   32C    P8             22W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:81:00.0 Off |                  Off |
| 30%   34C    P8             21W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:A1:00.0 Off |                  Off |
| 30%   33C    P8             34W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:C1:00.0 Off |                  Off |
| 30%   33C    P8             15W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:E1:00.0 Off |                  Off |
| 30%   31C    P8             17W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

when queueing to lane worker2

sparcuser@cryosparc-master:~$ cryosparcm eventlog $csprojectid $csjobid
sparcuser@cryosparc-master:~$ cryosparcm joblog $csprojectid $csjobid | tail -n 20
/home/sparcuser/homes/widu/CS-widustests/J53/job.log: No such file or directory
sparcuser@cryosparc-master:~$ ssh sparcuser@cryosparc-worker2 "hostname && nvidia-smi" -> see above

what I notice when queueing the job to the lane worker2:

  • the message “GPU not available” appears instantly
  • on worker2, there’s nothing in /var/log/auth.log - no sign of any login from master

Thanks @widu for posting this information. Please can you email us the tgz file that is created when you run the command
cryosparcm snaplogs. I will let you know our email address via a direct message.

I have just sent the data. Many thanks for your help.

Please can you try the following sequence of commands (on the CryoSPARC master server) and actions and post the outputs:

  1. run the commands
    cryosparcm icli # enter the cryosparc interactive cli
    import datetime
    list(db.jobs.find({'status': {'$in': ['launched','started','running', 'waiting']}, 'deleted': False}, {'project_uid': 1, 'uid': 1, 'resources_allocated': 1))
    datetime.datetime.now()
    # leave icli open during next step
    
  2. Queue a GPU-accelerated job to worker2
  3. inside the interactive cli from the first step, run (after replacing P99, J199 with the actual project and job IDs, respectively):
    datetime.datetime.now()
    cli.get_job('P99', 'J199', 'instance_information', 'params_spec', 'job_type', 'status', 'version')
    list(db.jobs.find({'status': {'$in': ['launched','started','running', 'waiting']}, 'deleted': False}, {'project_uid': 1, 'uid': 1, 'resources_allocated': 1))
    # record outputs, then
    exit()