Exclusive mode GPU issue

Hello all, new questions about an old error… we get a pycuda._driver.LogicError: cuDevicePrimaryCtxRetain failed: invalid device ordinal when running gpu’s in exclusive mode. would love to discuss our setup and what we could do to overcome this issue. This is when queuing to a lane, but not running on specific GPU.

Hi @CryoEM2,

When queueing to a lane, CryoSPARC doesn’t queue jobs to GPUs that already have other jobs running on them. Does this error happen when you have a non-CryoSPARC application running on the GPU before the CryoSPARC job is run?

They are otherwise idle, and dedicated for cryosparc.

can I cut out the (me)ddleman and take it offline to redacted@domain with my support team?

Hey @CryoEM2,

To ensure other users who may be encountering the same issue can search this thread in the future, it would be best to have your team post here. Thanks!

Thanks, I think I can explain a bit:

We have 4 GPU in a lane (amazon G5)
We were seeing that multiple processes were being piled onto 1 GPU at the same time and the others unused (I have a picture). So we put them to “exclusive” mode. This resolved the problem and there seemed to be no issues for a few months, 4 CS jobs could run on this lane. Recently, we’ve been getting the above mentioned error for some jobs and/or jobtypes (haven’t tracked it down) and they can’t run on this lane, I think only when another job is already running. but requeuing the failed job to a different lane works fine, so it is just this lane that is the problem. As i understand it, the only change we’ve made is cryosparc updates.
We are now going to remove exclusive mode and see if the jobs are distributed across available GPUs as intended. Stay tuned, I will update results.

What are the output of
cryosparcm cli "get_scheduler_targets()"
on this CryoSPARC instance,
and the name of the lane where cuDevicePrimaryCtxRetain errors occur?

It’s G5 worker2, the third of the three lanes listed below.

[
    {
        "cache_path": "/scratch",
        "cache_quota_mb": null,
        "cache_reserve_mb": 100,
        "desc": null,
        "gpus": [
            {
                "id": 0,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 1,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 2,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 3,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 4,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 5,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 6,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 7,
                "mem": 15843721216,
                "name": "Tesla T4"
            }
        ],
        "hostname": "",
        "lane": "default",
        "monitor_port": null,
        "name": "",
        "resource_fixed": {
            "SSD": true
        },
        "resource_slots": {
            "CPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47,
                48,
                49,
                50,
                51,
                52,
                53,
                54,
                55,
                56,
                57,
                58,
                59,
                60,
                61,
                62,
                63,
                64,
                65,
                66,
                67,
                68,
                69,
                70,
                71,
                72,
                73,
                74,
                75,
                76,
                77,
                78,
                79,
                80,
                81,
                82,
                83,
                84,
                85,
                86,
                87,
                88,
                89,
                90,
                91,
                92,
                93,
                94,
                95
            ],
            "GPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "RAM": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47
            ]
        },
        "ssh_str": "",
        "title": "",
        "type": "node",
        "worker_bin_path": "/cryosparc/cryosparc_worker/bin/cryosparcw"
    },
    {
        "cache_path": null,
        "cache_quota_mb": null,
        "cache_reserve_mb": 10000,
        "desc": null,
        "gpus": [
            {
                "id": 0,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 1,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 2,
                "mem": 15843721216,
                "name": "Tesla T4"
            },
            {
                "id": 3,
                "mem": 15843721216,
                "name": "Tesla T4"
            }
        ],
        "hostname": “",
        "lane": "g5-worker-1",
        "monitor_port": null,
        "name": "",
        "resource_fixed": {
            "SSD": false
        },
        "resource_slots": {
            "CPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47
            ],
            "GPU": [
                0,
                1,
                2,
                3
            ],
            "RAM": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23
            ]
        },
        "ssh_str": "",
        "title": "",
        "type": "node",
        "worker_bin_path": "/cryosparc/cryosparc_worker/bin/cryosparcw"
    },
    {
        "cache_path": null,
        "cache_quota_mb": null,
        "cache_reserve_mb": 10000,
        "desc": null,
        "gpus": [
            {
                "id": 0,
                "mem": 23836098560,
                "name": "A10G"
            },
            {
                "id": 1,
                "mem": 23836098560,
                "name": "A10G"
            },
            {
                "id": 2,
                "mem": 23836098560,
                "name": "A10G"
            },
            {
                "id": 3,
                "mem": 23836098560,
                "name": "A10G"
            }
        ],
        "hostname": "",
        "lane": "g5-worker-2",
        "monitor_port": null,
        "name": "",
        "resource_fixed": {
            "SSD": false
        },
        "resource_slots": {
            "CPU": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47,
                48,
                49,
                50,
                51,
                52,
                53,
                54,
                55,
                56,
                57,
                58,
                59,
                60,
                61,
                62,
                63,
                64,
                65,
                66,
                67,
                68,
                69,
                70,
                71,
                72,
                73,
                74,
                75,
                76,
                77,
                78,
                79,
                80,
                81,
                82,
                83,
                84,
                85,
                86,
                87,
                88,
                89,
                90,
                91,
                92,
                93,
                94,
                95
            ],
            "GPU": [
                0,
                1,
                2,
                3
            ],
            "RAM": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18,
                19,
                20,
                21,
                22,
                23,
                24,
                25,
                26,
                27,
                28,
                29,
                30,
                31,
                32,
                33,
                34,
                35,
                36,
                37,
                38,
                39,
                40,
                41,
                42,
                43,
                44,
                45,
                46,
                47
            ]
        },
        "ssh_str": "",
        "title": "",
        "type": "node",
        "worker_bin_path": "/cryosparc/cryosparc_worker/bin/cryosparcw"
    }
]

If you still run into the issue where multiple jobs get scheduled to the same GPU, can you immediately send a system error report?

we will turn off exclusive mode and watch whether this problem recurs - and report it.

so far, since turning OFF exclusive mode, jobs have returned to queuing normally - no errors, and no multiple operations on one GPU simultaneously. Thanks!

1 Like