LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

closed

#1

We are having persistent errors with specific GPU accelerated jobs (2D classification, homogenous refinement) on our machines with Titan XPs, running v2.9.0.

No errors on these machines:

  • Dual RTX Titan; Debian GNU/Linux 8.10
  • GTX 1060 3GB; Debian GNU/Linux 8.11 (beyond obviously boxsize limitations)

Errors on these machines:

  • Dual Titan XP; Debian GNU/Linux 8.11
  • Dual 1080 Ti; Debian GNU/Linux 8.11

These jobs ran successfully on older versions of cryoSPARC, but it is unclear when the issue arose.

We have updated to CUDA 10.1, with no resolution of the issue.

Our cryosparc2_worker/config.sh file from the Titan XP machine

export CRYOSPARC_LICENSE_ID="<our license number is here>"
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/opt/cuda/10.1.168"
export CRYOSPARC_DEVELOP=false

2D Classification job always errors out immediately after “Start of Iteration 0”:

Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 991, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:27291)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 83, in cryosparc2_compute.engine.engine.EngineThread.load_image_data_gpu (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:5179)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

Homogenous Refinement always errors out on “Estimating scale of initial reference”:

====== Refinement ======
Refining Structure with volume size 500.
Starting at initial resolution 30.000A (radwn 29.333).
Aligning initial model to symmetry.
Estimating scale of initial reference.

Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 1053, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:28374)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 308, in cryosparc2_compute.engine.engine.EngineThread.compute_resid_pow (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:11165)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS

The two existing forum threads that I’ve found are not helpful:


Any ideas what may be the root of the issue? Thanks in advance!


#2

Also relevant:
When the GTX 1060 3GB is added to the Dual Titan XP machine, 2D Classification and Homogenous Refine work without issue-- but only if the job is assigned to the weaker card (GTX 1060). If it grabs either of the Titan XPs, the jobs fails as described above.
Incredibly confusing…


#3

This is indeed very odd… I don’t think we have seen this error message from CUDA before (@sarulthasan?). And the GPU kernels for 2D Classification and Homogeneous Refinement haven’t changed in several versions…
There also seems to be no info online about similar errors. We have passed along this report to the pyCUDA developers list (the error is coming from CUDA via pyCUDA)


#4

Hi @dgoetschius,

The pyCUDA developers are also baffled by this error unfortunately…

I'm sorry to say that I've never seen or heard of this error
message. One thing that comes to mind is that this might be an issue of
PCIe versioning. The 1060 might be PCIe3, while the XP might be PCIe2
(guessing, might be better to check), and driver support might
differ. 

In case you haven’t already, it may be worthwhile to upgrade/downgrade the NVIDIA driver version and see if that changes anything.


#5

Thanks for the response-- I’ll let you know if we have any more luck troubleshooting things on our end.


#6

Update in case it helps anyone else troubleshoot:

So far we’ve isolated the problem to OS version Debian GNU/Linux 8.11 (jessie). As 8.10 worked perfectly, something in the 8.11 release seems to have broken compatibility.

After an update to Debian GNU/Linux 9.9 (stretch), these jobs types are working. Next up we’ll be testing on Debian 10. Fingers crossed.