Hi all,
I’m not a researcher however I’m trying to still Cryosparc on our cluster for the research team.
I followed the installation guide with no apparent issues.
However, jobs can’t seem to pass the CUDA init call.
After getting stuck, jobs created by cryosparc become zombie, and even after killing their parent process, I can’t kill the child zombie process and have to reboot the compute node.
For instance, I’m trying the tutorial on the empiar dataset as available in the website.
Where do I start debugging my installation?
Here is the output of nvidia-smi
[root@gn01 ~]# nvidia-smi
Thu Sep 26 11:36:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           On  | 00000000:21:00.0 Off |                    0 |
| N/A   32C    P0              25W / 250W |     10MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           On  | 00000000:E2:00.0 Off |                    0 |
| N/A   34C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     17520      C   python                                        8MiB |
+---------------------------------------------------------------------------------------+
Notice the 8MB python process.
Here is the output of cryosparcw gpulist:
  Detected 2 CUDA devices.
   id           pci-bus  name
   ---------------------------------------------------------------
       0                33  Tesla P100-PCIE-16GB                                                                
       1               226  Tesla P100-PCIE-16GB                                                                
   ---------------------------------------------------------------
Here is the log output of the job:
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#SBATCH --job-name=cryosparc_P2_J2
#SBATCH --output=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.out
#SBATCH --error=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --reservation gpu
#SBATCH --partition gpu
ulimit -s unlimited
export CRYOSPARC_TIFF_IO_SHM=false
export CUDA_VISIBLE_DEVICES=1
module load ohpc
source ~/.bashrc
nvidia-smi
srun /mnt/beegfs_compat/home/cryosparcuser/cryosparc_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname headnode.mendeleyev.abtlus.org.br --master_command_core_port 61002 > /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/job.log 2>&1 
==========================================================================
==========================================================================
-------- Submission command: 
sbatch /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/queue_sub_script.sh
-------- Cluster Job ID: 
65359
-------- Queued on cluster at 2024-09-26 11:29:33.921784
-------- Cluster job status at 2024-09-26 11:29:34.461618 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             65359       gpu cryospar cryospar  R       0:00      1 gn01 
[CPU:   79.5 MB]
Job J2 Started
[CPU:   79.7 MB]
Master running v4.6.0, worker running v4.6.0
[CPU:   79.9 MB]
Working in directory: /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2
[CPU:   79.9 MB]
Running on lane cluster_gpu_one
[CPU:   79.9 MB]
Resources allocated: 
[CPU:   79.9 MB]
  Worker:  cluster_gpu_one
[CPU:   79.9 MB]
  CPU   :  [0, 1, 2, 3, 4, 5]
[CPU:   79.9 MB]
  GPU   :  [0]
[CPU:   79.9 MB]
  RAM   :  [0, 1]
[CPU:   79.9 MB]
  SSD   :  False
[CPU:   79.9 MB]
--------------------------------------------------------------
[CPU:   79.9 MB]
Importing job module for job type patch_motion_correction_multi...
[CPU:  209.6 MB]
Job ready to run
[CPU:  209.7 MB]
***************************************************************
[CPU:  209.9 MB]
Job will process this many movies:  20
[CPU:  209.9 MB]
Job will output denoiser training data for this many movies:  20
[CPU:  209.9 MB]
Random seed: 155870792
[CPU:  209.9 MB]
parent process is 17495
[CPU:  161.5 MB]
Calling CUDA init from 17520