I guess this is likely something in my setup and not regarding cryoSPARC, but only while running cryoSPARC I notice this situation: my RTX 3090 seems to run all the types of jobs with no problem, but more memory demanding jobs fail with a “cuMemHostAlloc failed: OS call failed or operation not supported on this OS”; for instance an Abinitio reconstruction with more than 6 classes; while my RTX 2080 SUPER TURBO (8GB memory) are able to carry the same jobs, at least until their effective memory runs out.
Error
[CPU: 1.04 GB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1149, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1150, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1089, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 305, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 333, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS
I know that this error is possibly related with incorrectly setup of the GPU drivers, but I wonder why it runs so well many jobs, but above a requirement threshold, it gives out this error.
As seen below, I am running Driver Version: 470.42.01 and CUDA Version: 11.4
nvidia-smi
Fri Oct 29 10:20:28 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:03:00.0 On | N/A |
| 27% 39C P8 17W / 250W | 361MiB / 7979MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:21:00.0 Off | N/A |
| 30% 44C P8 25W / 350W | 1MiB / 24268MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:4B:00.0 Off | N/A |
| 27% 30C P8 9W / 250W | 1MiB / 7982MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
%
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2416 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 3397 G /usr/bin/gnome-shell 69MiB |
| 0 N/A N/A 3842 G /usr/lib/xorg/Xorg 199MiB |
| 0 N/A N/A 3998 G …mviewer/tv_bin/TeamViewer 13MiB |
| 0 N/A N/A 4004 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 4524 G …4E128C00922B04BFB3CDD04F3 11MiB |
| 0 N/A N/A 13077 G …1/usr/lib/firefox/firefox 15MiB |
| 0 N/A N/A 13311 G …1/usr/lib/firefox/firefox 2MiB |
±----------------------------------------------------------------------------+
We have had issues before with trying to use the cuda cores on gpus which were being utilized by x-server; you might want to try to hide gpu 0 from cuda and see if that helps (export CUDA_VISIBLE_DEVICES=1,2). When we had this issue the cuda workloads would fail with various non specific errors even when there was plenty of gpu memory available.
Thanks @nimgs-it, your suggestion helped me to sort out a bit the problem, although it was not an answer to the problem.
I think the problem is somehow about CUDA configuration or installation of CUDA and/or driver. I thought this driver and CUDA versions were compatible and good for RTX 3090, no?
Is there anyone out there with a RTX 3090 running cryoSPARC? What drivers and CUDA versions do you have?
We are using driver 470.82.00 with cuda 11.2.2 on our 3090 cryosparc nodes, We are also on ubuntu 20.04.3 lts.
I was originally using the 460.32.03 driver that in bundled with the cuda 11.2.2 installer but was having issue with gpus going missing if the system was up for more than a few weeks.
the GA release of 18.04 was with kernel 4.15; you would have to switch to the HWE kernel to get 5.4, or switch to 20.04 which uses 5.4 as the GA kernel and 5.11 as the HWE kernel.
You’ll have to install the linux-generic-hwe-18.04 metapackage to switch.
@nimgs-it, thank you so much for the help trying to sort out this problem.
So that Linux works properly with a RTX3090, I definitely needed the kernel update which at first I did not think of before starting to write this thread. At the same time, the problem was a combination making the installation of the right versions of CUDA, Nvidia drivers and the kernel, in a specific order.
Having it running Ubuntu 18.04.6 LTS:
I updated the kernel to 5.4.0-91-generic x86_64 (HWE)
Removed/purged the systems from any nvidia installation