RTX 3090 - Memory allocation/driver problem: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

AndreGraca · October 29, 2021, 8:25am

Hi everyone,

I guess this is likely something in my setup and not regarding cryoSPARC, but only while running cryoSPARC I notice this situation: my RTX 3090 seems to run all the types of jobs with no problem, but more memory demanding jobs fail with a “cuMemHostAlloc failed: OS call failed or operation not supported on this OS”; for instance an Abinitio reconstruction with more than 6 classes; while my RTX 2080 SUPER TURBO (8GB memory) are able to carry the same jobs, at least until their effective memory runs out.

Error

[CPU: 1.04 GB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/abinit/run.py”, line 304, in cryosparc_compute.jobs.abinit.run.run_homo_abinit
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1149, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1150, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 1089, in cryosparc_compute.engine.engine.process.work
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 305, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 333, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

I know that this error is possibly related with incorrectly setup of the GPU drivers, but I wonder why it runs so well many jobs, but above a requirement threshold, it gives out this error.

As seen below, I am running Driver Version: 470.42.01 and CUDA Version: 11.4

nvidia-smi

Fri Oct 29 10:20:28 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:03:00.0 On | N/A |
| 27% 39C P8 17W / 250W | 361MiB / 7979MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:21:00.0 Off | N/A |
| 30% 44C P8 25W / 350W | 1MiB / 24268MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:4B:00.0 Off | N/A |
| 27% 30C P8 9W / 250W | 1MiB / 7982MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
%
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2416 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 3397 G /usr/bin/gnome-shell 69MiB |
| 0 N/A N/A 3842 G /usr/lib/xorg/Xorg 199MiB |
| 0 N/A N/A 3998 G …mviewer/tv_bin/TeamViewer 13MiB |
| 0 N/A N/A 4004 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 4524 G …4E128C00922B04BFB3CDD04F3 11MiB |
| 0 N/A N/A 13077 G …1/usr/lib/firefox/firefox 15MiB |
| 0 N/A N/A 13311 G …1/usr/lib/firefox/firefox 2MiB |
±----------------------------------------------------------------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021

Any idea why this could happen?

stephan · October 29, 2021, 2:40pm

Hey @AndreGraca,

Check out this post, might be helpful:

Also, can you report your OS?
uname -a

AndreGraca · October 30, 2021, 10:02pm

Hi Stephan,

Thanks for the suggestion!
Unfortunately seems like this post does not help.

Sorry that I missed reporting my OS, but I thought it was such a usual OS to run cryoSPARC that I bypassed the information: Ubuntu 18.04

Linux xxxx 4.15.0-161-generic #169-Ubuntu SMP Fri Oct 15 13:41:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Something I noticed and perhaps will tell you something. When I submit

echo $CUDA_VISIBLE_DEVICES

I get an empty answer from my bash. I thought that perhaps this is not the answer that I should expect(?)
May this be related?

nimgs-it · November 5, 2021, 2:43am

We have had issues before with trying to use the cuda cores on gpus which were being utilized by x-server; you might want to try to hide gpu 0 from cuda and see if that helps (export CUDA_VISIBLE_DEVICES=1,2). When we had this issue the cuda workloads would fail with various non specific errors even when there was plenty of gpu memory available.

AndreGraca · November 11, 2021, 3:47pm

Hi everyone!

Thanks @nimgs-it, your suggestion helped me to sort out a bit the problem, although it was not an answer to the problem.
I think the problem is somehow about CUDA configuration or installation of CUDA and/or driver. I thought this driver and CUDA versions were compatible and good for RTX 3090, no?

Is there anyone out there with a RTX 3090 running cryoSPARC? What drivers and CUDA versions do you have?

Thanks!

nimgs-it · November 11, 2021, 4:24pm

We are using driver 470.82.00 with cuda 11.2.2 on our 3090 cryosparc nodes, We are also on ubuntu 20.04.3 lts.

I was originally using the 460.32.03 driver that in bundled with the cuda 11.2.2 installer but was having issue with gpus going missing if the system was up for more than a few weeks.

AndreGraca · November 11, 2021, 11:31pm

Thanks @nimgs-it

I think that I will play a bit with the installation of both as soon as possible.

@stephan, do you know of any workstations running cryosparc on 3090’s and CUDA 11.4?

AndreGraca · November 13, 2021, 11:52am

@nimgs-it and @stephan

I started to realised that the problem might be an incompatibility between the kernel version and the cuda-toolkit.

I have still a kernel version 4.15.0, which seems that it was not updated when I updated the machine from Ubuntu 16.04 to 18.04.
CUDA toolkit documentation mentions that for Ubuntu 18.04.z the supported kernel version is 5.4.0 (https://docs.nvidia.com/cuda/archive/11.2.2/cuda-installation-guide-linux/index.html)

Sounds plausible? What kernel versions do you have?

In your opinion, what is the best way to update the kernel with less probability for side effects?

Have a good weekend,
André

nimgs-it · November 13, 2021, 2:50pm

the GA release of 18.04 was with kernel 4.15; you would have to switch to the HWE kernel to get 5.4, or switch to 20.04 which uses 5.4 as the GA kernel and 5.11 as the HWE kernel.

You’ll have to install the linux-generic-hwe-18.04 metapackage to switch.

AndreGraca · December 6, 2021, 2:27pm

Hi everyone!

@nimgs-it, thank you so much for the help trying to sort out this problem.

So that Linux works properly with a RTX3090, I definitely needed the kernel update which at first I did not think of before starting to write this thread. At the same time, the problem was a combination making the installation of the right versions of CUDA, Nvidia drivers and the kernel, in a specific order.

Having it running Ubuntu 18.04.6 LTS:

I updated the kernel to 5.4.0-91-generic x86_64 (HWE)
Removed/purged the systems from any nvidia installation
Got the Nvidia 470.42.01 driver installed
Installed CUDA 11.5 by meticulously following the instructions from nvidia (https://docs.nvidia.com/cuda/archive/11.5.0/)

That made it and it has been working now for 2 weeks =)

/André