CUDA issue after updating to v.4.4

KyleBarrie · November 9, 2023, 2:24am

Hello,

I just upgraded our labs’ workstations after hearing about v.4.4 (looking forward to trying out all the new features!). However, when I tried running an extract from micrographs job, I get the following error:

Error occurred while processing micrograph J2/motioncorrected/011693211652979670157_20211206_1006_A002_G000_H100_D001_patch_aligned_doseweighted.mrc
Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 3007, in add_ptx
    driver.cuLinkAddData(self.handle, input_ptx, ptx, len(ptx),
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNSUPPORTED_PTX_VERSION] Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/pipeline.py", line 61, in exec
    return self.process(item)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/run.py", line 508, in process
    result = extraction_gpu.do_extract_particles_single_mic_gpu(mic=mic, bg_bin=bg_bin,
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/extract/extraction_gpu.py", line 161, in do_extract_particles_single_mic_gpu
    ET.patches_gpu.fill(0, stream=stream)
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/gpu/gpuarray.py", line 110, in fill
    from .elementwise.fill import fill
  File "/home/cryosparc/cryosparc_worker/cryosparc_compute/gpu/elementwise/fill.py", line 24, in <module>
    def fill(arr, x, out):
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/np/ufunc/decorators.py", line 203, in wrap
    guvec.add(fty)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/np/ufunc/deviceufunc.py", line 475, in add
    kernel = self._compile_kernel(fnobj, sig=tuple(outertys))
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/vectorizers.py", line 241, in _compile_kernel
    return cuda.jit(sig)(fnobj)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/decorators.py", line 133, in _jit
    disp.compile(argtypes)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 928, in compile
    kernel.bind()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 207, in bind
    self._codelibrary.get_cufunc()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/codegen.py", line 184, in get_cufunc
    cubin = self.get_cubin(cc=device.compute_capability)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/codegen.py", line 159, in get_cubin
    linker.add_ptx(ptx.encode())
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 3010, in add_ptx
    raise LinkerError("%s\n%s" % (e, self.error_log))
numba.cuda.cudadrv.driver.LinkerError: [CUresult.CUDA_ERROR_UNSUPPORTED_PTX_VERSION] Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION
ptxas application ptx input, line 9; fatal   : Unsupported .version 7.8; current version is '7.3'

Marking J2/motioncorrected/011693211652979670157_20211206_1006_A002_G000_H100_D001_patch_aligned_doseweighted.mrc as incomplete and continuing...

The error repeats for each micrograph it tries to extract particles from. Seems to be an issue with CUDA incompatibility after the update. Should I just try to update CUDA or is there another solution? Thanks for the help!

Best,
Kyle

wilnart · November 9, 2023, 5:02am

@KyleBarrie …make sure your NVIDIA driver version is 520.61.05 or newer. If not upgrade the driver and restart your workstation. That should fix it

KyleBarrie · November 9, 2023, 9:16pm

Thanks for the response @wilnart I think thats probably the problem. Will update our driver and provide an update on whether that solves the issue.

vijayverma · November 15, 2023, 4:25pm

@KyleBarrie
did your issue solved ? I am also facing the same issue. It will help me.
Thank You

KyleBarrie · November 19, 2023, 8:08pm

Sorry for the late response. We’ve been having trouble updating our nvidia driver but will provide an update once it’s finished.

KyleBarrie · November 21, 2023, 8:49pm

We updated our CUDA version and Nvidia driver:

Now when we try to run a GPU job in cryoSPARC, it stalls when starting the pipeline and never begins running. There seems to still be a CUDA error occurring. For one, cryoSPARC cannot ‘see’ our GPUs:

(base) [cryosparc@C05195 ~]$ /home/cryosparc/cryosparc_worker/bin/cryosparcw gpulist
Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNKNOWN] Call to cuInit results in CUDA_ERROR_UNKNOWN

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/cryosparc/cryosparc_worker/bin/connect.py", line 28, in print_gpu_list
    num_devs = len(cuda.gpus)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 49, in __len__
    return len(self.lst)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 425, in get_device_count
    return self.cuDeviceGetCount()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999)

Additionally, when we try to connect the cryoSPARC worker to our nodes, the CPUs connect fine but a very similar CUDA error occurs when it tries to connect to the GPUs:

(base) [cryosparc@C05195 cryosparc_worker]$ bin/cryosparcw connect \
> --worker localhost \
> --master localhost \
> --port 39000 \
> --update \
> --gpus 0,1,2,3
 ---------------------------------------------------------------
  CRYOSPARC CONNECT --------------------------------------------
 ---------------------------------------------------------------
  Attempting to register worker localhost to command localhost:39002
  Connecting as unix user cryosparc
  Will register using ssh string: cryosparc@localhost
  If this is incorrect, you should re-run this command with the flag --sshstr <ssh string> 
 ---------------------------------------------------------------
  Connected to master.
 ---------------------------------------------------------------
  Current connected workers:
    localhost
 ---------------------------------------------------------------
  Worker will be registered with 48 CPUs.
 ---------------------------------------------------------------
  Updating target localhost
  Current configuration:
               cache_path :  /data
           cache_quota_mb :  None
         cache_reserve_mb :  10000
                     desc :  None
                     gpus :  [{'id': 0, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 1, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 2, 'mem': 11554717696, 'name': 'NVIDIA GeForce RTX 2080 Ti'}, {'id': 3, 'mem': 11554324480, 'name': 'NVIDIA GeForce RTX 2080 Ti'}]
                 hostname :  localhost
                     lane :  default
             monitor_port :  None
                     name :  localhost
           resource_fixed :  {'SSD': True}
           resource_slots :  {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}
                  ssh_str :  cryosparc@localhost
                    title :  Worker node localhost
                     type :  node
          worker_bin_path :  /home/cryosparc/cryosparc_worker/bin/cryosparcw
 ---------------------------------------------------------------
  Autodetecting available GPUs...
Traceback (most recent call last):
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 352, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 412, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNKNOWN] Call to cuInit results in CUDA_ERROR_UNKNOWN

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/connect.py", line 165, in <module>
    gpu_devidxs = check_gpus()
  File "bin/connect.py", line 101, in check_gpus
    num_devs = print_gpu_list()
  File "bin/connect.py", line 28, in print_gpu_list
    num_devs = len(cuda.gpus)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 49, in __len__
    return len(self.lst)
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 425, in get_device_count
    return self.cuDeviceGetCount()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999)

We’re wondering if there is some incompatibility or lack of cross-talk between the CUDA version we updated and the CUDA toolkit that comes pre-installed with v4.4. Should we uninstall and reinstall cryoSPARC and import our existing projects, or is there a simpler potential solution? Thanks!

wtempel · November 21, 2023, 9:24pm

@KyleBarrie The v535 nvidia driver is the latest version I have tested, but I am not sure the driver version is the problem. Please can you provide additional information

From which repository did you obtain the driver?
With which command(s) did you upgrade the driver?
How did you install the nvidia-smi utility, whose version seems to differ from the driver’s?
Did you reboot the computer after upgrading the driver?

What are the outputs of these commands:

/home/cryosparc/cryosparc_worker/bin/cryosparcw call env | grep PATH
uname -a

KyleBarrie · November 21, 2023, 10:00pm

Thanks for your quick response. Please find below the answers to your questions:

We obtained it from the Nvidia website.
We used the following commands and tried to follow the Nvidia guide:

sudo sh ./NVIDIA-Linux-x86_64-535.129.03.run

sudo rpm -i cuda-repo-rhel7-12-3-local-12.3.1_545.23.08-1.x86_64.rpm

sudo yum clean all

sudo yum install nvidia-driver-latest-dkms

sudo yum install cuda-toolkit

sudo yum install cuda-drivers

We believe the nvidia-smi utility was installed as part of the driver update (from the CUDA toolkit)
Yes
Output of the indicated commands:

(base) [cryosparc@C05195 exx]$ /home/cryosparc/cryosparc_worker/bin/cryosparcw call env | grep PATH
NUMBA_CUDA_INCLUDE_PATH=/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
LD_LIBRARY_PATH=/usr/local/relion-3.1/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-10.1/lib:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/lib64:/usr/local/cuda-9.2/lib:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.1/lib:/usr/local/cuda-9.1/lib64:/usr/local/cuda-8.0/lib:/usr/local/cuda-8.0/lib64:/usr/local/cuda-7.5/lib:/usr/local/cuda-7.5/lib64:/usr/local/IMOD/lib:
PATH=/home/cryosparc/cryosparc_worker/bin:/home/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/cryosparc/cryosparc_worker/deps/anaconda/condabin:/usr/local/EMAN_2.31/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/home/cryosparc/cryosparc_master/bin:/usr/local/relion-3.1/bin:/usr/local/mpich-3.2.1/bin:/usr/local/cuda/bin:/home/exx/cryoEF_v1.1.0/PreCompiled/centos5.11:/home/exx/eman2-sphire-sparx/condabin:/usr/local/relion-3.1/bin:/usr/local/mpich-3.2.1/bin:/usr/local/cuda/bin:/usr/local/IMOD/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/motioncorr_v2.1/bin:/usr/local/Gctf_v1.06/bin:/usr/local/Gctf_v0.50/bin:/usr/local/ResMap:/usr/local/cistem-1.0.0-beta:/usr/local/EMAN_2.31/bin:/home/exx/.local/bin:/home/exx/bin:/usr/local/motioncorr_v2.1/bin:/usr/local/Gctf_v1.06/bin:/usr/local/Gctf_v0.50/bin:/usr/local/ResMap:/usr/local/cistem-1.0.0-beta:/usr/local/EMAN_2.31/bin
CRYOSPARC_PATH=/home/cryosparc/cryosparc_worker/bin
PYTHONPATH=/home/cryosparc/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda-11.1

(base) [cryosparc@C05195 exx]$ uname -a
Linux C05195 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

wtempel · November 21, 2023, 11:10pm

I am not sure whether a run file installation is compatible with a subsequent rpm-based installation. Did you try reversing that run file installation before installing the drivers with yum?

Meanwhile, I confirmed that nvidia driver 545 on Ubuntu-22.04 is compatible with CryoSPARC v4.4 as far as

the cryosparcw gpulist command
the first few jobs of the Extensive Validation (nonuniform refinement within that workflow is ongoing)

are concerned.

KyleBarrie · November 27, 2023, 4:26pm

Hi wtempel,

Thanks for your suggestions and insight. We got to thinking that the issue may have to do with the nvidia-smi incompatibility with the new driver as you indicated in your previous message. We therefore:

Uninstalled the old nvidia-smi utility
Installed a new version (same version no. as the driver)
Rebooted the workstation

These steps appear to have resolved the issue, we can now run cryoSPARC jobs without fail. Thanks again for your help!

Best,
Kyle