Job attempts to run on GPU not in CUDA_VISIBLE_DEVICES

Hello,

Because of insufficient RAM I tried to limit the number of GPUs available for cryosparc.
echo $CUDA_VISIBLE_DEVICES
0,1
However, if I submit a third job that requires GPU it starts on GPU2 and then fails.
GPU : [2]
[CPU: 1.51 GB Avail: 364.38 GB]
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_compute.run.main
File “cryosparc_master/cryosparc_compute/jobs/refine/newrun.py”, line 331, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
File “/home/sparc/cryosparc_worker/cryosparc_compute/alignment.py”, line 113, in align_symmetry
cuda_core.initialize([cuda_dev])
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 34, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal
Please asvise.
Thank you,
Yehuda

Please can you:

  • describe how and where you set CUDA_VISIBLE_DEVICES
  • post the output of the command
    cryosparcm cli "get_scheduler_targets()"
    

export CUDA_VISIBLE_DEVICES=0,1
restarted cryosparc
cryosparcm cli “get_scheduler_targets()”

[
    {
        "cache_path": "/scratch/cryosparc_cache",
        "cache_quota_mb": None,
        "cache_reserve_mb": 10000,
        "desc": None,
        "gpus": [
            {
                "id": 0,
                "mem": 11546394624,
                "name": "NVIDIA GeForce RTX 2080 Ti"
            },
            {
                "id": 1,
                "mem": 11546394624,
                "name": "NVIDIA GeForce RTX 2080 Ti"
            }
        ],
        "hostname": "grizzly.mskcc.org",
        "lane": "default",
        "monitor_port": None,
        "name": "grizzly.mskcc.org",
        "resource_fixed": {
            "SSD": True
        },
        "resource_slots": {
            "CPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47,
                48,
                49,
                50,
                51,
                52,
                53,
                54,
                55
            ],
            "GPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "RAM": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47
            ]
        },
        "ssh_str": "sparc@grizzly.mskcc.org",
        "title": "Worker node grizzly.mskcc.org",
        "type": "node",
        "worker_bin_path": "/home/sparc/cryosparc_worker/bin/cryosparcw"
    }
]

To which file did you add this statement?
What is the output of the command
nvidia-smi -L?

nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-11d542b8-e90f-0d88-5441-32d85172ee20)

GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-07cbf185-4454-5490-b0f4-c2e6375dd487)

GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-be7229fd-47e4-0fbd-11ac-f7d653574f73)

GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-8e8f595e-b195-029f-5aef-470b09eef5b6)

GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-677dd919-2e78-e1b0-a0ce-9a309a8fbef4)

GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-88d45ede-5298-f72b-e052-d0863d15a5dd)

GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-1c1cec2b-c045-1a91-9dec-2692151ed7c4)

GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-af4bff55-57e2-931e-4f6e-a8f071b9bebc)

export CUDA_VISIBLE_DEVICES=0,1 was entered in the command line. Cryosparc was started and restarted from the same terminal window.

CryoSPARC may try to run a job on a GPU in the "GPU" list:

and the job may subsequently fail because the selected GPU was not a “visible” CUDA device. Instead of specifying CUDA_VISIBLE_DEVICES, you may want to shorten the list of GPU devices in the target collection. You can do this with

cryosparc_worker/bin/cryosparcw connect --master $master_hostname --port $base_port --worker $(hostname -f) --ssdpath /scratch/cryosparc_cache --gpus 0,1 --update

(docs). The (properly modified) command needs to be run on the GPU worker, without CUDA_VISIBLE_DEVICES defined. Please substitute the actual values for the $master_hostname ($CRYOSPARC_MASTER_HOSTNAME in cryosparc_master/config.sh) and $base_port and confirm that I did not miss anything while transcribing information from your target configuration.

1 Like

After reconnecting the master Igot
Final configuration for grizzly.mskcc.org
cache_path : /scratch/cryosparc_cache
cache_quota_mb : None
cache_reserve_mb : 10000
desc : None
gpus : [{‘id’: 0, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 1, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 2, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 3, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 4, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 5, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 6, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}, {‘id’: 7, ‘mem’: 11546394624, ‘name’: ‘NVIDIA GeForce RTX 2080 Ti’}]
hostname : grizzly.mskcc.org
lane : default
monitor_port : None
name : grizzly.mskcc.org
resource_fixed : {‘SSD’: True}
resource_slots : {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], ‘GPU’: [0, 1], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}
ssh_str : sparc@grizzly.mskcc.org
title : Worker node grizzly.mskcc.org
type : node
worker_bin_path : /home/sparc/cryosparc_worker/bin/cryosparcw

Is this correct?

Also, all 8 GPUs are listed in the job builder.

An update:
After reconnecting worker, two GPUs are in use. A third job was queued until one of the running jobs was done. A job cloned from one of the completed jobs is running much slower.
J90: Total time 3175.85s
J94: [2023-10-19 9:36:51.18] Launching job on lane default target grizzly.mskcc.org
[2023-10-19 11:14:57.77] [CPU: 6.04 GB Avail: 366.76 GB]
No output masks, groups or particles after two hours. What shall I look at?
Thank you,

Yehuda

Please can you post a screenshot, and updated output form the command

cryosparcm cli "get_scheduler_targets()"

Was not this the intended behavior?

Please compare timings of the completed subroutines between both jobs and post the discrepancies, in addition to details like job type, cache settings, if applicable, etc.

Yes, this was the intended behavior. I opened a new topic. Thank you.