cryoSPARC-4.4.0 installation completed successfully, but jobs are failing with below error

Praveen · December 4, 2023, 3:35pm

================= CRYOSPARCW =======  2023-12-04 13:00:03.524000  =========
Project P11 Job J25
Master ip-XXXXX Port 39002
===========================================================================
========= monitor process now starting main process at 2023-12-04 13:00:03.524035
MAINPROCESS PID 13285
MAIN PID 13285
refine.newrun cryosparc_compute.jobs.jobregister
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "cryosparc_master/cryosparc_compute/run.py", line 189, in cryosparc_master.cryosparc_compute.run.run
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/cryosparc_compute/get_gpu_info.py", line 30, in get_driver_version
    return get_version()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 3318, in get_version
    return driver.get_version()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 465, in get_version
    version = driver.cuDriverGetVersion()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "cuda/cuda.pyx", line 11325, in cuda.cuda.cuInit
  File "cuda/ccuda.pyx", line 17, in cuda.ccuda.cuInit
  File "cuda/_cuda/ccuda.pyx", line 2353, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found
***************************************************************
Running job  J25  of type  homo_refine_new
Running job on hostname %s g5-singlegpu-queue
Allocated Resources :  {'fixed': {'SSD': True}, 'hostname': 'g5-singlegpu-queue', 'lane': 'g5-singlegpu-queue', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2]}, 'target': {'cache_path': '/scratch', 'cache_quota_mb': 1000000, 'cache_reserve_mb': 10000, 'custom_var_names': [], 'custom_vars': {}, 'desc': None, 'hostname': 'g5-singlegpu-queue', 'lane': 'g5-singlegpu-queue', 'name': 'g5-singlegpu-queue', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': None, 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete commandstring to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed.\n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name ofthe user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH--job-name {{ cryosparc_username }}_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH -N 1\n#SBATCH --constraint="g5.2xlarge|g5.4xlarge|g5.8xlarge"\n#SBATCH --gres=gpu:{{ num_gpu }}\n##SBATCH --mem={{ (ram_gb)|int }}G\n#SBATCH -o {{ job_dir_abs }}/slurm-%j.out\n#SBATCH -e {{ job_dir_abs }}/slurm-%j.err\n#SBATCH --exclusive --partition=g5-singlegpu\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'g5-singlegpu-queue', 'tpl_vars': ['command', 'cluster_job_id', 'job_log_path_abs', 'job_dir_abs', 'run_cmd', 'run_args', 'cryosparc_username', 'job_uid', 'worker_bin_path', 'project_dir_abs', 'job_creator', 'num_cpu', 'project_uid', 'ram_gb', 'num_gpu'], 'type': 'cluster', 'worker_bin_path': '/wekahome/apps/cryosparc/current/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/cryosparc_compute/jobs/motioncorrection/mic_utils.py:95: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit(nogil=True)
/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/cryosparc_compute/micrographs.py:563: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicitdefault value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def contrast_normalization(arr_bin, tile_size = 128):
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 95, in cryosparc_master.cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/refine/newrun.py", line 359, in cryosparc_master.cryosparc_compute.jobs.refine.newrun.run_homo_refine
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/cryosparc_compute/alignment.py", line 112, in align_symmetry
    gpucore.initialize([cuda_dev])
  File "cryosparc_master/cryosparc_compute/gpu/gpucore.py", line 47, in cryosparc_master.cryosparc_compute.gpu.gpucore.initialize
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 220, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 144, in get_or_create_context
    return self._activate_context_for(devnum)
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 176, in _activate_context_for
    gpu = self.gpus[devnum]
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 40, in __getitem__
    return self.lst[devnum]
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 425, in get_device_count
    return self.cuDeviceGetCount()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "cuda/cuda.pyx", line 11325, in cuda.cuda.cuInit
  File "cuda/ccuda.pyx", line 17, in cuda.ccuda.cuInit
  File "cuda/_cuda/ccuda.pyx", line 2353, in cuda._cuda.ccuda._cuInit
RuntimeError: Function "cuInit" not found
set status to failed
(base) [root@ip-172-31-75-142 J25]# cd -
/wekahome/apps/ctx-scc-driver/scripts/applications/cryosparc/scripts

wtempel · December 4, 2023, 10:02pm

@Praveen What are the versions of OS and nvidia driver on the GPU nodes?

Praveen · December 5, 2023, 6:00am

@wtempel The OS is “alinux2”

NAME=“Amazon Linux”
VERSION=“2”
ID=“amzn”
ID_LIKE=“centos rhel fedora”
VERSION_ID=“2”
PRETTY_NAME=“Amazon Linux 2”
ANSI_COLOR=“0;33”
CPE_NAME=“cpe:2.3:o:amazon:amazon_linux:2”
HOME_URL=“https://amazonlinux.com/”
Amazon Linux release 2 (Karoo)

Nvidia Driver on the GPU node

Tue Dec  5 05:59:28 2023
+-----------------------------------------------------------------------------+
| **NVIDIA-SMI 520.61.05**    **Driver Version: 520.61.05    CUDA Version: 11.8**     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8    23W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Praveen · December 5, 2023, 6:01am

@wtempel, In the same cluster-v4.3.1 works well. But we are facing issues only with version 4.4.0

wtempel · December 5, 2023, 2:43pm

Thanks @Praveen. What are the outputs of these commands on that GPU server:

uname -a
/sbin/ldconfig -p | grep cu

?
[Added 2023-12-08:]
In case you have not yet resolved the RuntimeError please can you compress and email us
the file /tmp/libsdebug.txt that is created by this command sequence

cd /wekahome/apps/cryosparc/v4.4.0_231114/cryosparc_worker/
eval $(bin/cryosparcw env)
LD_DEBUG=libs python -c "from cuda import cuda; cuda.cuInit(0)" 2> /tmp/libsdebug.txt