CUDA missing after Cryosparc 4.4 update

BrianCuttler · November 9, 2023, 2:15pm

Updated from Cryosparc 3.3.x to 4.4.

The env setting for CUDA is removed, but Cryosparc is supposed to install its own CUDA 11.8.
I don’t see 11.8, looking at .cryosparc user’s .bashrc I think we expect to find cuda on path
/usr/lib/nvidia-cuda-toolkit/ but that directory does not exist.

So my update seems to have broken Cryosparc, need to either reporing cuda to /usr/local/cuda-11.8 which I’d been running, fix the path in .bashrc or actually install the package in the expected location.

Best path for quick resolution?

thanks in advance,
Brian

KiSchnelle · November 9, 2023, 2:56pm

I have found a maybe connecting issue, i was updating just normal on some nodes from 4.3 to 4.4 meanwhile reading the patch notes i found out about the cuda thing. So i removed, because i was curious, the variable in the config.sh before updating and it worked without problem but i was confused because it said cudnn dependencies did not change so out of more curiosity i tried on another node to update with --override and then i get an error. So together:

cryosparc update just → no error
delete Cuda path variable from config and update → no error
delete cuda path variable from config and update --override → error
dont delete cuda path variable from config and update --override → same error

Forcing reinstall for dependency cudnn...
  ------------------------------------------------------------------------
cp: cannot stat 'deps_bundle/external/cudnn/include': No such file or directory
cp: cannot stat 'deps_bundle/external/cudnn/lib': No such file or directory
  cudnn 8.1.0.77 for CUDA 11 installation successful.
  ------------------------------------------------------------------------
Completed.

i then looked a bit further into the install scripts and found in install_cudnn.sh which is called using override

echo "  ------------------------------------------------------------------------"
cd ${CRYOSPARC_ROOT_DIR}
rm -rf deps/external/cudnn
mkdir -p deps/external/cudnn
rm -rf deps/external/cudnn # comment by me, why two times this 2 lines?:D
mkdir -p deps/external/cudnn
cp -r deps_bundle/external/cudnn/include deps_bundle/external/cudnn/lib deps/external/cudnn
echo "  cudnn 8.1.0.77 for CUDA 11 installation successful."
echo "  ------------------------------------------------------------------------"

i then looked into the folder and the things its supposed to copy are just not there

cryosparcuser@bert103:~/cryosparc_worker$ ls -al deps_bundle/external/
ctffind/ gctf/

Edit1:
Checking the folder i also only see include and lib after not override, with override its empty as expected.

# this node with --override
cryosparcuser@bert103:~/cryosparc_worker$ ls deps/external/cudnn/
cryosparcuser@bert103:~/cryosparc_worker$

# this node without --override
cryosparcuser@bert105:~/cryosparc_worker$ ls deps/external/cudnn/
include  lib
cryosparcuser@bert105:~/cryosparc_worker$

have not yet tried to start jobs with it after override thats next:)

cheers
Kilian

wtempel · November 9, 2023, 4:26pm

CUDA-related entries should not be needed in shell startup files. CryoSPARC v4.4 no longer depends on an externally installed CUDA toolkit as CUDA-related dependencies are now bundled with CryoSPARC.

BrianCuttler · November 9, 2023, 4:40pm

I could be barking up the wrong tree, let me instead do what I ask my end-users to do and
present the error message that was reported. Should have done that first…

Traceback (most recent call last): File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 3007, in add_ptx driver.cuLinkAddData(self.handle, input_ptx, ptx, len(ptx), File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 352, in safe_cuda_api_call return self._check_cuda_python_error(fname, libfn(*args)) File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 412, in _check_cuda_python_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [CUresult.CUDA_ERROR_UNSUPPORTED_PTX_VERSION] Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/jobs/runcommon.py”, line 2192, in run_with_except_hook run_old(*args, **kw) File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 117, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 118, in cryosparc_master.cryosparc_compute.gpu.gpucore.GPUThread.run File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1069, in cryosparc_master.cryosparc_compute.engine.engine.process.work File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 141, in cryosparc_master.cryosparc_compute.engine.engine.EngineThread.load_image_data_gpu File “cryosparc_master/cryosparc_compute/engine/cuda_kernels.py”, line 1784, in cryosparc_master.cryosparc_compute.engine.cuda_kernels.prepare_real File “cryosparc_master/cryosparc_compute/gpu/gpucore.py”, line 450, in cryosparc_master.cryosparc_compute.gpu.gpucore.context_dependent_memoize.wrapper File “cryosparc_master/cryosparc_compute/engine/cuda_kernels.py”, line 1698, in cryosparc_master.cryosparc_compute.engine.cuda_kernels.get_util_kernels File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/compiler.py”, line 214, in get_function cufunc = self.get_module().get_function(name) File “/home/cryosparc_user/cryosparc_worker/cryosparc_compute/gpu/compiler.py”, line 170, in get_module linker.add_cu(s, k) File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 3022, in add_cu self.add_ptx(program.ptx, ptx_name) File “/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 3010, in add_ptx raise LinkerError(“%s\n%s” % (e, self.error_log)) numba.cuda.cudadrv.driver.LinkerError: [CUresult.CUDA_ERROR_UNSUPPORTED_PTX_VERSION] Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION ptxas application ptx input, line 9; fatal : Unsupported .version 7.8; current version is ‘7.4’

wtempel · November 9, 2023, 5:25pm

Does this computer meet the nvidia driver version ≥ 520 requirement?

BrianCuttler · November 9, 2023, 5:34pm

I believe so, unless there are some missing modules.

nvidia-smi seems to give valid output.

(base) cryosparc_user@suraj:~/cryosparc_worker/deps_bundle$ apt list nvidia* | grep installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-modprobe/unknown,now 535.113.01-0lambda1 amd64 [installed]
nvidia-prime/focal-updates,focal-updates,now 0.8.16~0.20.04.2 all [installed,automatic]
nvidia-settings/unknown,now 535.113.01-0lambda1 amd64 [installed]

wtempel · November 9, 2023, 6:00pm

@BrianCuttler Did the error occur on a single workstation, connected workers, or cluster-type installation of CryoSPARC?
What is the output of the commands (on the GPU worker where the error occurred):

/home/cryosparc_user/cryosparc_worker/bin/cryosparcw call env | grep PATH
/home/cryosparc_user/cryosparc_worker/bin/cryosparcw call nvidia-smi --query-gpu=name,driver_version --format=csv

BrianCuttler · November 9, 2023, 7:38pm

(base) cryosparc_user@suraj:~/cryosparc_worker$ bin/cryosparcw call env | grep PATH
CRYOSPARC_PATH=/home/cryosparc_user/cryosparc_worker/bin
PYTHONPATH=/home/cryosparc_user/cryosparc_worker
CRYOSPARC_CUDA_PATH=/usr/local/cuda
NUMBA_CUDA_INCLUDE_PATH=/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/include
PATH=/home/cryosparc_user/cryosparc_worker/bin:/home/cryosparc_user/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/home/cryosparc_user/cryosparc_worker/deps/anaconda/condabin:/home/cryodrgn/anaconda3/bin:/home/cryodrgn/anaconda3/condabin:/home/cryosparc_user/cryosparc_master/bin:/home/cryosparc_user/cryosparc2_master/bin:/usr/local:/usr/lib/cuda:/usr/local:/usr/lib/nvidia-cuda-toolkit:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin

(base) cryosparc_user@suraj:~/cryosparc_worker$ bin/cryosparcw call nvidia-smi --query-gpu=name,driver_version --format=csv
name, driver_version
NVIDIA GeForce RTX 2080 Ti, 470.199.02
NVIDIA GeForce RTX 2080 Ti, 470.199.02
NVIDIA GeForce RTX 2080 Ti, 470.199.02
NVIDIA GeForce RTX 2080 Ti, 470.199.02

for what its worth, there is no /home/cryodrgn/anaconda3, but I don’t know if we reach that far down the path or anything after that.

(base) cryosparc_user@suraj:~/cryosparc_worker$ ls /home/cryodrgn
Anaconda3-2021.11-Linux-x86_64.sh Desktop Documents Downloads Music Pictures Public Templates Videos anaconda3 cryodrgn

Unclear to me where CRYOSPARC_CUDA_PATH is defined, I see it in output, but I don’t see it in any files.

I have another system, upgraded within minutes of this one that seems to be working, initial runs in progress now. Working is Ubuntu 22.04, where this is a 20.04.6 LTS release.

I don’t know that that matters, but I often look at differential between systems.

wtempel · November 9, 2023, 9:29pm

nvidia-drivers on host suraj need to be updated to v520 or higher, as mentioned earlier, and, possibly, the host needs to be rebooted after the driver update. nvidia-smi indicated the driver is currently at version v470.

This definition may be present inside /home/cryosparc_user/cryosparc_worker/config.sh for historic reasons and is obsolete in CryoSPARC v4.4.

BrianCuttler · November 10, 2023, 3:41pm

With the output from apt showing a newer version and my not parsing (in part formatting) the output from the worker show env correctly, I’d misread/assumed 470 as microcode rather than driver.

Driver level was the culprit and the issues have been corrected, Cryosparc jobs have been successfully run.

Thank you!

wtempel · November 15, 2023, 10:55pm

Thanks for reporting this bug, which we are planning to fix in a future release.
Once the fix has been released and applied, it will take an additional
cryosparcw update for the fix to become effective . Should
cryosparcw update --override become necessary before the fix is released, applied and effective, please manually remove the directory

cryosparc_worker/deps_bundle_hashes/external/cudnn/

before running
cryosparcw update --override
.

NDietz · November 17, 2023, 8:15am

Unfortunately we ran in the same issue as Brian on our distributed CS in an HPC environment, during updating from 4.3 to 4.4.4 yesterday.
It would have saved us some headache if the installation routine would check the NVIDIA driver version on the workers for compliance with the (changed) installation requirements… Maybe you can implement this in the future?

wtempel · November 20, 2023, 8:59pm

After running
cryosparcm update and, if needed,
cryosparcw update
you may test the validity of the the GPU configuration either through the command line or the GUI.
As of CryoSPARC v4.4, the test includes a check of the nvidia driver version.

mvup · November 28, 2023, 10:48am

We ran into the same problem as @NDietz . Unfortunately, it is not possible to “simply update your nvidia driver” in an HPC environment. Thus, a check before running through the whole update would be really helpful.

Regarding the current situation: is it possible to still use an external installation of the cuda toolkit? Is it possible to simply downgrade from 4.4 to 4.3 again? (edit: yes, downgrading worked)

nfrasser · December 5, 2023, 5:30pm

FYI the fix described by @wtempel in CUDA missing after Cryosparc 4.4 update - #11 by wtempel is out in the latest v4.4.1