skcuda.cublas.cublasNotInitialized Error in Filament Tracer

RussellM · March 15, 2021, 8:19pm

Hi All,

Since cryosparc added beta functionality for filamentous/helical proteins I have recently started trying to learn the program, however during the helix tracer job I have run into some errors:

**File "cryosparc_worker/cryosparc_compute/run.py", line 84, in cryosparc_compute.run.main
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 61, in cryosparc_compute.jobs.template_picker_gpu.run.run
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 246, in cryosparc_compute.jobs.template_picker_gpu.run.do_pick
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 373, in cryosparc_compute.jobs.template_picker_gpu.run.do_pick
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py", line 127, in __init__
    onembed, ostride, odist, self.fft_type, self.batch)
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 742, in cufftMakePlanMany
    cufftCheckStatus(status)
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cufft.py", line 117, in cufftCheckStatus
    raise e
skcuda.cufft.cufftAllocFailed**

This process/job at least “begins” - I am shown several prepared templates at the beginning before cryosparc reads out an error. I am running on a workstation using 4x RTX 2080 cards initially with NVidia drivers using cuda 10.1. I saw advice for similar issues to update to newer drivers with cuda 11.x, which I have done, however after doing that, updating the cuda path in the config.sh file and restarting the cryosparc process I am given a new error that happens as the job type is imported (no templates a written out):

**Traceback (most recent call last):
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 280, in _get_cublas_version
    utils.get_soname(cublas_path)).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cryosparc_worker/cryosparc_compute/run.py", line 71, in cryosparc_compute.run.main
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/jobregister.py", line 362, in get_run_function
    runmod = importlib.import_module(".."+modname, __name__)
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 1050, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "cryosparc_worker/cryosparc_compute/jobs/template_picker_gpu/run.py", line 11, in init cryosparc_compute.jobs.template_picker_gpu.run
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/cryosparc_compute/engine/__init__.py", line 8, in <module>
    from .engine import *  # noqa
  File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 11, in init cryosparc_compute.engine.engine
  File "cryosparc_worker/cryosparc_compute/engine/gfourier.py", line 6, in init cryosparc_compute.engine.gfourier
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/fft.py", line 20, in <module>
    from . import misc
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/misc.py", line 25, in <module>
    from . import cublas
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 292, in <module>
    _cublas_version = int(_get_cublas_version())
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 285, in _get_cublas_version
    h = cublasCreate()
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/home/cryosparc_user/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
skcuda.cublas.cublasNotInitialized**

Thank you,
Russell McFarland

stephan · March 16, 2021, 3:19pm

Hi @RussellM,

Can you confirm that there is nothing in your .bashrc related to CUDA that may be interfering? Can you also check cryosparc2_worker/config.sh?

On CUDA 10.2, one of the issues was that the location of the libcublas library had been changed from inside /usr/local/cuda to /usr/lib64/, so you had to create the following symbolic link so that scikit-cuda could find it:
sudo ln -s /usr/lib64/libcublas.so /usr/local/cuda-10.2/lib64/libcublas.so

If you’re on CUDA 11, this shouldn’t be an issue.
Can you paste the output of nvidia-smi?
Then, can you navigate to cryosparc2_worker and do the following:
eval $(./bin/cryosparcw env)
nvcc --version #paste the output here

RussellM · March 16, 2021, 5:20pm

Thank you for the response,

My .bashrc doesn’t have anything that related to CUDA at all and config.sh points towards /usr/local/cuda which is the proper path for the cuda install.

Ah! I think there might be a problem when I updated CUDA/nvidia drivers. When I run nvidia-smi I’m given this error: Failed to initialize NVML: Driver/library version mismatch. I will go ahead and fix that before anything else.

Best,
Russell

RussellM · March 17, 2021, 10:03pm

Output of nvidia-smi:

    +-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    On   | 00000000:3B:00.0  On |                  N/A |
| 25%   33C    P8     5W / 215W |     94MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    On   | 00000000:5E:00.0 Off |                  N/A |
| 22%   30C    P8    15W / 215W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 2080    On   | 00000000:86:00.0 Off |                  N/A |
| 23%   32C    P8     5W / 215W |      8MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 2080    On   | 00000000:D8:00.0 Off |                  N/A |
| 24%   31C    P8     4W / 215W |      8MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3665      G   /usr/bin/X                         74MiB |
|    0   N/A  N/A      4473      G   /usr/bin/gnome-shell               17MiB |
|    2   N/A  N/A      3665      G   /usr/bin/X                          6MiB |
|    3   N/A  N/A      3665      G   /usr/bin/X                          6MiB |
+-----------------------------------------------------------------------------+

output of nvcc --version:

    nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Thank you,
Russell

stephan · March 18, 2021, 6:15pm

Hi @RussellM,

After fixing the nvidia-smi issue, do you still get the skcuda.cublas.cublasNotInitialized error?

RussellM · March 18, 2021, 6:18pm

Yes I do - sorry I wasn’t clear on that in my above post.

Once I correct the driver mismatch, I’m given the first error in my original post once again.

mmclean · March 18, 2021, 6:28pm

Hi @RussellM,

The template/blob/filament pickers all use GPU memory in proportion to the low-pass filtered micrograph size; the default low-pass filter of 20 Å internally determines how much GPU memory is taken up by the micrograph. As well, the number of templates and rotational sampling also affects how much memory is used. Seeing as the first error seems to be a genuine out-of-memory error, you can try the following to reduce the memory requirement of the job:

Increase the low-pass filter parameter to 30-40 Å
Increase the angular step (from 1º to 5º, for example)

Please let us know if either of these changes helps alleviate the memory error.

Best,
Michael

RussellM · March 18, 2021, 7:10pm

I modified parameters as suggested and the job is running now. Thank you!