LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

dgoetschius · August 7, 2019, 3:41pm

We are having persistent errors with specific GPU accelerated jobs (2D classification, homogenous refinement) on our machines with Titan XPs, running v2.9.0.

No errors on these machines:

Dual RTX Titan; Debian GNU/Linux 8.10
GTX 1060 3GB; Debian GNU/Linux 8.11 (beyond obviously boxsize limitations)

Errors on these machines:

Dual Titan XP; Debian GNU/Linux 8.11
Dual 1080 Ti; Debian GNU/Linux 8.11

These jobs ran successfully on older versions of cryoSPARC, but it is unclear when the issue arose.

We have updated to CUDA 10.1, with no resolution of the issue.

Our cryosparc2_worker/config.sh file from the Titan XP machine

export CRYOSPARC_LICENSE_ID=“<our license number is here>”
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH=“/opt/cuda/10.1.168”
export CRYOSPARC_DEVELOP=false

2D Classification job always errors out immediately after “Start of Iteration 0”:

Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 991, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:27291)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 83, in cryosparc2_compute.engine.engine.EngineThread.load_image_data_gpu (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:5179)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

Homogenous Refinement always errors out on “Estimating scale of initial reference”:

====== Refinement ======
Refining Structure with volume size 500.
Starting at initial resolution 30.000A (radwn 29.333).
Aligning initial model to symmetry.
Estimating scale of initial reference.

Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 1053, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:28374)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 308, in cryosparc2_compute.engine.engine.EngineThread.compute_resid_pow (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:11165)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

The two existing forum threads that I’ve found are not helpful:

Any ideas what may be the root of the issue? Thanks in advance!

dgoetschius · August 7, 2019, 6:40pm

Also relevant:
When the GTX 1060 3GB is added to the Dual Titan XP machine, 2D Classification and Homogenous Refine work without issue-- but only if the job is assigned to the weaker card (GTX 1060). If it grabs either of the Titan XPs, the jobs fails as described above.
Incredibly confusing…

apunjani · August 14, 2019, 3:40am

This is indeed very odd… I don’t think we have seen this error message from CUDA before (@stephan?). And the GPU kernels for 2D Classification and Homogeneous Refinement haven’t changed in several versions…
There also seems to be no info online about similar errors. We have passed along this report to the pyCUDA developers list (the error is coming from CUDA via pyCUDA)

apunjani · August 14, 2019, 4:38am

Hi @dgoetschius,

The pyCUDA developers are also baffled by this error unfortunately…

I'm sorry to say that I've never seen or heard of this error
message. One thing that comes to mind is that this might be an issue of
PCIe versioning. The 1060 might be PCIe3, while the XP might be PCIe2
(guessing, might be better to check), and driver support might
differ.

In case you haven’t already, it may be worthwhile to upgrade/downgrade the NVIDIA driver version and see if that changes anything.

dgoetschius · August 14, 2019, 2:30pm

Thanks for the response-- I’ll let you know if we have any more luck troubleshooting things on our end.

dgoetschius · September 5, 2019, 5:02pm

Update in case it helps anyone else troubleshoot:

So far we’ve isolated the problem to OS version Debian GNU/Linux 8.11 (jessie). As 8.10 worked perfectly, something in the 8.11 release seems to have broken compatibility.

After an update to Debian GNU/Linux 9.9 (stretch), these jobs types are working. Next up we’ll be testing on Debian 10. Fingers crossed.

kpahil · April 23, 2023, 1:12am

Sorry to resurrect an old thread. Have started having this error on our Quadro P4000 node. We’re on cryosparc v4.1.1, Driver Version: 460.84, CUDA Version: 11.2, CentOS Linux 7 . Also only seems to happen sporadically

wtempel · April 24, 2023, 12:06am

@kpahil Because you are using a different version from the older posts, please can you post

error messages and preceding context from the event and job logs

output of

/path/to/cryosparc_worker/bin/cryosparcw env | grep -v LICENSE

CryoSPARC worker environment details as described here

kpahil · April 24, 2023, 3:14am

I’ve since cleared and try to rerun the relevant jobs. They’ve worked sometimes. I don’t have a record of the “LogicError cuMemHostALloc failed: OS call failed” error log at the moment (I’ll update if/when it happens again). I have also been getting “cuMemHostAlloc failed out of memory”- errors very frequently which is new for this node for these job types (8gb gpus that have historically had no problem with NU, heterogeneous refinement, and 2D classification which have all given this error). nvidia-smi consistently says there isn’t anything else using the relevant gpus (often none of the GPUs are in use) when I see the error either. I’m guessing these are related since they’re both CUDA memory errors and started at the same time, but maybe I’m mistaken. Here’s relevant log:

[CPU: 2.53 GB]
– Effective number of classes per image: min 1.00 | 25-pct 1.00 | median 1.04 | 75-pct 1.32 | max 2.00

[CPU: 2.53 GB]
– Class 0: 83.40%

[CPU: 2.53 GB]
– Class 1: 16.60%

[CPU: 2.50 GB]
Learning rate 0.065

[CPU: 2.52 GB]
Done iteration 31 in 10.223s. Total time so far 362.996s

[CPU: 2.52 GB]
– Iteration 32

[CPU: 2.52 GB]
Batch size 2000

[CPU: 2.52 GB]
Using Alignment Radius 27.117 (8.519A)

[CPU: 2.52 GB]
Using Reconstruction Radius 41.000 (5.634A)

[CPU: 2.52 GB]
Number of BnB iterations 3

[CPU: 2.52 GB]
DEV 1 THR 1 NUM 500 TOTAL 2.5987076 ELAPSED 3.1239066 –

[CPU: 2.56 GB]
Processed 1000.000 images with 2 models in 4.237s.

[CPU: 2.56 GB]
Engine Started.

[CPU: 3.62 GB]
Traceback (most recent call last):
File “/data1/cryosparc/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py”, line 2057, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1089, in cryosparc_compute.engine.engine.process.work
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 500, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 337, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

job log:
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in force_free_cufft_plan:
exception in force_free_cufft_plan:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
exception in cufft.Plan.del:
========= sending heartbeat
**custom thread exception hook caught something
**** handle exception rc
Traceback (most recent call last):
File “/data1/cryosparc/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/runcommon.py”, line 2057, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 1089, in cryosparc_compute.engine.engine.process.work
File “cryosparc_master/cryosparc_compute/engine/engine.py”, line 500, in cryosparc_compute.engine.engine.EngineThread.cull_candidates
File “cryosparc_master/cryosparc_compute/engine/cuda_core.py”, line 337, in cryosparc_compute.engine.cuda_core.EngineBaseThread.ensure_allocated
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
set status to failed
========= main process now complete.
========= monitor process now complete.

Below:
export “CRYOSPARC_USE_GPU=true”
export “CRYOSPARC_CONDA_ENV=cryosparc_worker_env”
export “CRYOSPARC_DEVELOP=false”
export “CRYOSPARC_ROOT_DIR=/data1/cryosparc/cryosparc2/cryosparc2_worker”
export “CRYOSPARC_PATH=/data1/cryosparc/cryosparc2/cryosparc2_worker/bin”
export “CRYOSPARC_CUDA_PATH=/data1/cryosparc/cuda/cuda_11.2.2_460.32.03”
export “PATH=/data1/cryosparc/cuda/cuda_11.2.2_460.32.03/bin:/data1/cryosparc/cryosparc2/cryosparc2_worker/bin:/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/bin:/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/condabin:/programs/x86_64-linux/system/sbgrid_bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/programs/share/bin:/programs/share/sbgrid/bin:/programs/x86_64-linux/sbgrid_installer/latest”
export “LD_LIBRARY_PATH=/data1/cryosparc/cuda/cuda_11.2.2_460.32.03/lib64:/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/external/cudnn/lib”
export “LD_PRELOAD=/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libpython3.8.so”
export “PYTHONPATH=/data1/cryosparc/cryosparc2/cryosparc2_worker”
export “PYTHONNOUSERSITE=true”
export “CONDA_SHLVL=1”
export “CONDA_PROMPT_MODIFIER=(cryosparc_worker_env)”
export “CONDA_EXE=/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/bin/conda”
export “CONDA_PREFIX=/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env”
export “CONDA_PYTHON_EXE=/data1/cryosparc/cryosparc2/cryosparc2_worker/deps/anaconda/bin/python”
export “CONDA_DEFAULT_ENV=cryosparc_worker_env”
Below:
-bash: /data1/cryosparc/cryosparc2/cryosparc2_worker: Is a directory
SB_ORIG_LD_LIBRARY_PATH=
SB_ORIG_DYLD_LIBRARY_PATH=
SB_ORIG_PYTHONPATH=
SB_ORIG_CLASSPATH=
PATH=/programs/x86_64-linux/system/sbgrid_bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/programs/share/bin:/programs/share/sbgrid/bin:/programs/x86_64-linux/sbgrid_installer/latest
SB_ORIG_PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
SB_ORIG_MANPATH=

/programs/x86_64-linux/system/sbgrid_bin/nvcc

nvcc: command not accessible in current configuration, nvcc: is included in
eman2 :
Installed Versions : 2.99, 2.91, 2.31, 20211129, 2.07, nightly_20230215

ImportError: No module named pycuda.driver

3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

         total        used        free      shared  buff/cache   available

Mem: 251 16 0 1 234 232
Swap: 3 3 0

Thank you!!

wtempel · April 25, 2023, 8:02pm

cuMemHostAlloc refers to “system” or “CPU”, as opposed to “GPU”, RAM. On some CentOS-7 systems, we recommend adding the definition

export CRYOSPARC_NO_PAGELOCK=true

to the worker configuration. In your specific case, this would involve adding that line to

/data1/cryosparc/cryosparc2/cryosparc2_worker/config.sh

(details).
Does this help?