CryosPARC freezing at CUDA init

Hi all,

I’m not a researcher however I’m trying to still Cryosparc on our cluster for the research team.

I followed the installation guide with no apparent issues.

However, jobs can’t seem to pass the CUDA init call.
After getting stuck, jobs created by cryosparc become zombie, and even after killing their parent process, I can’t kill the child zombie process and have to reboot the compute node.

For instance, I’m trying the tutorial on the empiar dataset as available in the website.

Where do I start debugging my installation?

Here is the output of nvidia-smi

[root@gn01 ~]# nvidia-smi
Thu Sep 26 11:36:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           On  | 00000000:21:00.0 Off |                    0 |
| N/A   32C    P0              25W / 250W |     10MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           On  | 00000000:E2:00.0 Off |                    0 |
| N/A   34C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     17520      C   python                                        8MiB |
+---------------------------------------------------------------------------------------+

Notice the 8MB python process.

Here is the output of cryosparcw gpulist:

  Detected 2 CUDA devices.

   id           pci-bus  name
   ---------------------------------------------------------------
       0                33  Tesla P100-PCIE-16GB                                                                
       1               226  Tesla P100-PCIE-16GB                                                                
   ---------------------------------------------------------------

Here is the log output of the job:


====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash

#SBATCH --job-name=cryosparc_P2_J2
#SBATCH --output=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.out
#SBATCH --error=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --reservation gpu
#SBATCH --partition gpu

ulimit -s unlimited
export CRYOSPARC_TIFF_IO_SHM=false
export CUDA_VISIBLE_DEVICES=1

module load ohpc

source ~/.bashrc

nvidia-smi

srun /mnt/beegfs_compat/home/cryosparcuser/cryosparc_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname headnode.mendeleyev.abtlus.org.br --master_command_core_port 61002 > /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/queue_sub_script.sh

-------- Cluster Job ID: 
65359

-------- Queued on cluster at 2024-09-26 11:29:33.921784

-------- Cluster job status at 2024-09-26 11:29:34.461618 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             65359       gpu cryospar cryospar  R       0:00      1 gn01 

[CPU:   79.5 MB]
Job J2 Started

[CPU:   79.7 MB]
Master running v4.6.0, worker running v4.6.0

[CPU:   79.9 MB]
Working in directory: /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2

[CPU:   79.9 MB]
Running on lane cluster_gpu_one

[CPU:   79.9 MB]
Resources allocated: 

[CPU:   79.9 MB]
  Worker:  cluster_gpu_one

[CPU:   79.9 MB]
  CPU   :  [0, 1, 2, 3, 4, 5]

[CPU:   79.9 MB]
  GPU   :  [0]

[CPU:   79.9 MB]
  RAM   :  [0, 1]

[CPU:   79.9 MB]
  SSD   :  False

[CPU:   79.9 MB]
--------------------------------------------------------------

[CPU:   79.9 MB]
Importing job module for job type patch_motion_correction_multi...

[CPU:  209.6 MB]
Job ready to run

[CPU:  209.7 MB]
***************************************************************

[CPU:  209.9 MB]
Job will process this many movies:  20

[CPU:  209.9 MB]
Job will output denoiser training data for this many movies:  20

[CPU:  209.9 MB]
Random seed: 155870792

[CPU:  209.9 MB]
parent process is 17495

[CPU:  161.5 MB]
Calling CUDA init from 17520

Welcome to the forum @bfocassio .
If your cluster is limiting a job’s access to GPU devices using cgroups, as I would recommend,

would refer to a (virtually) non-existing device, given the request for a single device

In that case,
export CUDA_VISIBLE_DEVICES=0 might work, but it might be better to omit all CUDA_VISIBLE_DEVICES definitions from the script template (see CUDA_ERROR_NO_DEVICE - but only when AF2 is running! - #9 by wtempel).

Thank you for answering.

I did try those changes and it didn’t work. In fact, all applications using GPUs have the same problem.
We are tracing it to some weird interaction between our current kernel (Oracle 8.3 5.4.17-2036.104.5.el8uek.x86_64) and the recent NVIDIA drivers. No driver version above 520 is working.

Our latest working version is 465.19.01 with CUDA 11.3 . What is the latest CryoSPARC version compatible with these versions?