skcuda.cublas.cublasNotInitialized after upgrade to version 3.0

Dear Crysosparc team,

I updated on our of instances of Cryosparc version 3.0.0 yesterday. This is a setup where the workers have a shared install directory and there seemed to be an issue with the initial worker installation (on the machine which is also the master) running correctly, so I had to install afresh by copying the cryosparc2_worker.tar.gz file and unpacking it as per the instructions.

Installation then seems to run happily, but there is an issue with cublas.py, as shown below, when launching a job. I also saw some other issues with Cuda, but solved these by updating the drivers and cuda version. I see the error with fresh installs (removing the worker directory and reinstalling from the .tar.gz) against both cuda-10-2 and cuda 11-1.

nvidia-smi doesn’t report any issues and other cuda software seems to be fine running under 10-2 (although not explicitly tried anything yet which needs cublas).

The OS is Centos-7.5 and the previous Cryosparc2 version ran without issues.

Any advice would be gratefully received.

======================================================================
[CPU: 198.0 MB]  Traceback (most recent call last):
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 280, in _get_cublas_version
    utils.get_soname(cublas_path)).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 71, in cryosparc_compute.run.main
  File "/data/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/jobregister.py", line 360, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_worker/cryosparc_compute/jobs/class2D/run.py", line 13, in init cryosparc_compute.jobs.class2D.run
  File "/data/cryosparc2/cryosparc_worker/cryosparc_compute/engine/__init__.py", line 8, in <module>
    from .engine import *  # noqa
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 11, in init cryosparc_compute.engine.engine
  File "cryosparc_worker/cryosparc_compute/engine/gfourier.py", line 6, in init cryosparc_compute.engine.gfourier
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py", line 20, in <module>
    from . import misc
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/misc.py", line 25, in <module>
    from . import cublas
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 292, in <module>
    _cublas_version = int(_get_cublas_version())
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 285, in _get_cublas_version
    h = cublasCreate()
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/data/cryosparc2/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
skcuda.cublas.cublasNotInitialized

Hi @AndyPurk,

Can you output the contents of your cryosparc_worker/config.sh file?
For some reason, scikit is not able to find the cublas library inside your LD_LIBRARY_PATH.

contents of cryosparc_worker/config.sh

export CRYOSPARC_LICENSE_ID=“xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx”
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/usr/local/cuda-11"
export CRYOSPARC_DEVELOP=false

there is a libcublas.so in /usr/local/cuda-11/lib64/

Edit to add that there is no LD_LIBRARY_PATH set in the shell.

Thanks

Hi @AndyPurk,

When you run the command eval $(cryosparc_worker/bin/cryosparcw env),
Do you see /usr/local/cuda-11/lib64/ in your LD_LIBRARY_PATH?

Yes, as follows:

$ echo $LD_LIBRARY_PATH
/data/cryosparc2/cryosparc_worker/cryosparc_compute/blobio:/usr/local/cuda-11/lib64:

Thanks.

Ok, after a lot of trial and error I managed to solve the issue by removing all Cuda versions, upgrading the Nvidia driver (from version 455.xx.xx to version 460.27.04) and installing just cuda-11.0 to keep things simple.

I think that the issue was the driver, but can’t be sure. All the different Cuda versions which were installed had issues with running the test NVIDIA cublas samples. All giving an error in device allocation before I updated the driver. I also had to reinstall the worker from scratch, as just pointing to the fresh Cuda install had issues during the re-compilation.

Other software compiled against Cuda-9 seems to run OK, which was why I didn’t think of the driver initially.

2 Likes

Hi @AndyPurk,

Thats great to hear- I wouldn’t have guessed it was the driver either. Thanks for the update!

For anyone else who may stumble on this, we ran into this error as well but the resolution was different. In our case the import of skcuda.cublas fails because the environment variable CUDA_VISIBLE_DEVICES was becoming unset on some nodes during initialization.

It looks like anything that prevents skcuda from creating a cublas context raises this same error, regardless of the reason. I’m opening an issue upstream with scikit-cuda to hopefully get a more useful error message.

Here’s the repro sequence, in case it’s useful.

user@node:~$ echo $CUDA_VISIBLE_DEVICES

user@node:~$ python -c 'import skcuda.cublas'
/snipped/miniconda3/envs/scipy_tk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
Traceback (most recent call last):
  File "/snipped/miniconda3/envs/scipy_tk/lib/python3.7/site-packages/skcuda/cublas.py", line 280, in _get_cublas_version
    utils.get_soname(cublas_path)).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/snipped/miniconda3/envs/test_env/lib/python3.7/site-packages/skcuda/cublas.py", line 292, in <module>
    _cublas_version = int(_get_cublas_version())
  File "/snipped/miniconda3/envs/test_env/lib/python3.7/site-packages/skcuda/cublas.py", line 285, in _get_cublas_version
    h = cublasCreate()
  File "/snipped/miniconda3/envs/test_env/lib/python3.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/snipped/miniconda3/envs/test_env/lib/python3.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
skcuda.cublas.cublasNotInitialized
user@node:~$ export CUDA_VISIBLE_DEVICES=0
user@node:~$ echo $CUDA_VISIBLE_DEVICES
0
user@node:~$ python -c 'import skcuda.cublas'
/snipped/miniconda3/envs/test_env/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
user@node:~$
2 Likes

Ran into the same error and tried many different ways trying to fix it but in vain. Surprisingly, reboot solved the problem. Hope this could help.

1 Like