Hi all,
I’m not a researcher however I’m trying to still Cryosparc on our cluster for the research team.
I followed the installation guide with no apparent issues.
However, jobs can’t seem to pass the CUDA init call.
After getting stuck, jobs created by cryosparc become zombie, and even after killing their parent process, I can’t kill the child zombie process and have to reboot the compute node.
For instance, I’m trying the tutorial on the empiar dataset as available in the website.
Where do I start debugging my installation?
Here is the output of nvidia-smi
[root@gn01 ~]# nvidia-smi
Thu Sep 26 11:36:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-PCIE-16GB On | 00000000:21:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 10MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE-16GB On | 00000000:E2:00.0 Off | 0 |
| N/A 34C P0 26W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 17520 C python 8MiB |
+---------------------------------------------------------------------------------------+
Notice the 8MB python process.
Here is the output of cryosparcw gpulist
:
Detected 2 CUDA devices.
id pci-bus name
---------------------------------------------------------------
0 33 Tesla P100-PCIE-16GB
1 226 Tesla P100-PCIE-16GB
---------------------------------------------------------------
Here is the log output of the job:
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#SBATCH --job-name=cryosparc_P2_J2
#SBATCH --output=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.out
#SBATCH --error=/mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/P2_J2_slurm.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --reservation gpu
#SBATCH --partition gpu
ulimit -s unlimited
export CRYOSPARC_TIFF_IO_SHM=false
export CUDA_VISIBLE_DEVICES=1
module load ohpc
source ~/.bashrc
nvidia-smi
srun /mnt/beegfs_compat/home/cryosparcuser/cryosparc_worker/bin/cryosparcw run --project P2 --job J2 --master_hostname headnode.mendeleyev.abtlus.org.br --master_command_core_port 61002 > /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/job.log 2>&1
==========================================================================
==========================================================================
-------- Submission command:
sbatch /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2/queue_sub_script.sh
-------- Cluster Job ID:
65359
-------- Queued on cluster at 2024-09-26 11:29:33.921784
-------- Cluster job status at 2024-09-26 11:29:34.461618 (0 retries)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
65359 gpu cryospar cryospar R 0:00 1 gn01
[CPU: 79.5 MB]
Job J2 Started
[CPU: 79.7 MB]
Master running v4.6.0, worker running v4.6.0
[CPU: 79.9 MB]
Working in directory: /mnt/beegfs_compat/home/cryosparcuser/projects_data/tutorial/CS-tutorial-gpu-test/J2
[CPU: 79.9 MB]
Running on lane cluster_gpu_one
[CPU: 79.9 MB]
Resources allocated:
[CPU: 79.9 MB]
Worker: cluster_gpu_one
[CPU: 79.9 MB]
CPU : [0, 1, 2, 3, 4, 5]
[CPU: 79.9 MB]
GPU : [0]
[CPU: 79.9 MB]
RAM : [0, 1]
[CPU: 79.9 MB]
SSD : False
[CPU: 79.9 MB]
--------------------------------------------------------------
[CPU: 79.9 MB]
Importing job module for job type patch_motion_correction_multi...
[CPU: 209.6 MB]
Job ready to run
[CPU: 209.7 MB]
***************************************************************
[CPU: 209.9 MB]
Job will process this many movies: 20
[CPU: 209.9 MB]
Job will output denoiser training data for this many movies: 20
[CPU: 209.9 MB]
Random seed: 155870792
[CPU: 209.9 MB]
parent process is 17495
[CPU: 161.5 MB]
Calling CUDA init from 17520