Hi,
yes other GPU-accelerated jobs run fine now.
The error is identical, to the original post in this thread:
> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] CPU : [0]
> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] GPU : [0]
> [2023-02-17 14:11:16.09] [CPU: 169.4 MB] RAM : [0, 1]
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] SSD : False
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] --------------------------------------------------------------
> [2023-02-17 14:11:16.10] [CPU: 169.4 MB] Importing job module for job type deep_picker_train...
> [2023-02-17 14:11:33.00] [CPU: 387.4 MB] Job ready to run
> [2023-02-17 14:11:33.00] [CPU: 387.4 MB] ***************************************************************
> [2023-02-17 14:11:33.77] [CPU: 447.9 MB] Using TensorFlow version 2.4.4
> [2023-02-17 14:11:33.91] [CPU: 473.7 MB] Traceback (most recent call last):? File "cryosparc_master/cryosparc_compute/run.py",
line 93, in cryosparc_compute.run.main? File
"cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in
cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train?AssertionError
: Input number of GPUs must be less than or equal to number of available GPUs. Please c
heck job log for more information.
‘cryosparcm joblog’ of the same job:
....
'tpl_vars': ['job_dir_abs', 'cluster_job_id', 'num_cpu', 'run_args', 'cryosparc_username', 'job_log_path_abs', 'ram_gb', 'worker_bin_path', 'job_uid', 'run_cmd', 'command', 'project_uid', 'project_dir_abs', 'job_creator', 'num_gpu'], 'type': 'cluster', 'worker_bin_path': '/path/to//cryosparc/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 67, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Input number of GPUs must be less than or equal to number of available GPUs. Please check job log for more information.
set status to failed
========= main process now complete.
========= monitor process now complete.
I logged into the worker node and here is the info:
env | grep PATH - I removed some paths (replaced with ‘XYZ’) since there were some specific cluster paths that may enable the identification of the cluster. If important I can send it privately.
env | grep PATH
LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:/path/to/cryosparc/cryosparc_worker/deps/external/cudnn/lib
CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.6
__LMOD_REF_COUNT_MODULEPATH=/XYZ/modules/el7/modules/all:2;/etc/modulefiles:1;/usr/share/modulefiles:1;/usr/share/modulefiles/Linux:1;/usr/share/modulefiles/Core:1;/usr/share/lmod/lmod/modulefiles/Core:1
CRYOSPARC_PATH=/path/to/cryosparc/cryosparc_worker/bin
PYTHONPATH=/path/to/cryosparc/cryosparc_worker
MANPATH=/usr/share/lmod/lmod/share/man::/XYZ/share/man
MODULEPATH=/XYZ/modules/el7/modules/all:/etc/modulefiles:/usr/share/modulefiles:/usr/share/modulefiles/Linux:/usr/share/modulefiles/Core:/usr/share/lmod/lmod/modulefiles/Core
MODULEPATH_ROOT=/usr/share/modulefiles
PATH=/usr/local/cuda-11.6/bin:/path/to/cryosparc/cryosparc_worker/bin:/path/to/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/path/to/cryosparc/cryosparc_worker/deps/anaconda/condabin:/XYZ/modules/el7/software/Anaconda3/2020.11/condabin:/path/to/cryosparc/cryosparc_master/bin:/XYZ/home/cryosparcuser/.local/bin:/XYZc/home/cryosparcuser/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin
which nvcc
/usr/local/cuda-11.6/bin/nvcc
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
python -c "import pycuda.driver; print(pycuda.driver.get_version())"
(11, 6, 0)
uname -a
Linux node.name 5.15.82-1.el8.name.x86_64 #1 SMP Tue Dec 13 15:02:32 CET 2022 x86_64 x86_64 x86_64 GNU/Linux
free -g
total used free shared buff/cache available
Mem: 125 7 91 0 27 116
Swap: 7 0 7
nvidia-smi
Mon Mar 6 22:02:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... On | 00000000:41:00.0 Off | 0 |
| N/A 31C P0 36W / 250W | 962MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 619708 C python 958MiB |
+-----------------------------------------------------------------------------+
Just a thought, could it be connected with TensorFlow?