Blackwell gpu worker stuck at launched

Hi CryoSPARC team,

I’m trying to add a new NVIDIA Blackwell GPU worker to the existing cluster with cryosparc v5 and although the GPU is detected and the worker connects correctly, no CryoSPARC job progresses beyond the “launched” stage on this worker. On other non-Blackwell workers, jobs run normally.

The cluster has a Master node: NVIDIA driver 575.xx; Worker 1 and 2 (non-Blackwell): NVIDIA driver 575.xx → Works fine, runs jobs normally; Worker 3 (Blackwell GPU): NVIDIA driver 580.xx
→ Visible to CryoSPARC, GPU detected, but all jobs hang at “launched.”

What I observe is that the Pixi environment under cryosparc_worker/.pixi does not build the correct versions of Torch / CUDA

So I attempted to reinstall CUDA 13 and PyTorch (e.g., torch 2.12.0 + cu130 wheels) - these also fail to launch kernels on the Blackwell GPU from within the CryoSPARC environment.

Any guidance or updated installation instructions would be helpful.
Thank you very much for your time and support.

-bharat

Welcome @bharat to the forum.

A (non-exhaustive) checklist for worker 3 related to this issue

  1. Is the shared project directory mounted under the same path as on the master?
  2. Is the project directory writeable?
  3. Does the CryoSPARC instance-associated Linux account exist and has the same numeric id as on the master?
  4. Is the cryosparc_worker/ directory available?
  5. Is the cryosparc_worker/ directory fully installed?
  6. Do the contents of the version (and, if existent, patch) files inside cryosparc_worker/ match their “siblings” inside cryosparc_master/?
  7. If worker 3 is of type node (not: cluster), can it be accessed via ssh from the master, requiring neither password nor manual host key confirmation?

If all items of the check list pass, but newly queued jobs om worker 3 are still stuck at Launched, please post

  1. the outputs of these commands (run on master, replace P99 with the relevant project ID)
    cryosparcm resources
    cryosparcm cli "api.projects.find_one('P99').project_dir"
    
  2. The ID of a job in that project that’s stuck in Launched state after queuing to worker 3

Thank you. Your questions helped. The project directory had missing write permissions from worker 3.

Thanks @bharat for the update.