We are having persistent errors with specific GPU accelerated jobs (2D classification, homogenous refinement) on our machines with Titan XPs, running v2.9.0.
No errors on these machines:
- Dual RTX Titan; Debian GNU/Linux 8.10
- GTX 1060 3GB; Debian GNU/Linux 8.11 (beyond obviously boxsize limitations)
Errors on these machines:
- Dual Titan XP; Debian GNU/Linux 8.11
- Dual 1080 Ti; Debian GNU/Linux 8.11
These jobs ran successfully on older versions of cryoSPARC, but it is unclear when the issue arose.
We have updated to CUDA 10.1, with no resolution of the issue.
Our cryosparc2_worker/config.sh file from the Titan XP machine
export CRYOSPARC_LICENSE_ID="<our license number is here>"
export CRYOSPARC_USE_GPU=true
export CRYOSPARC_CUDA_PATH="/opt/cuda/10.1.168"
export CRYOSPARC_DEVELOP=false
2D Classification job always errors out immediately after “Start of Iteration 0”:
Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 991, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:27291)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 83, in cryosparc2_compute.engine.engine.EngineThread.load_image_data_gpu (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:5179)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS
Homogenous Refinement always errors out on “Estimating scale of initial reference”:
====== Refinement ======
Refining Structure with volume size 500.
Starting at initial resolution 30.000A (radwn 29.333).
Aligning initial model to symmetry.
Estimating scale of initial reference.
Traceback (most recent call last):
File “cryosparc2_compute/jobs/runcommon.py”, line 830, in run_with_except_hook
run_old(*args, **kw)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 110, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4625)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 111, in cryosparc2_compute.engine.cuda_core.GPUThread.run (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:4576)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 1053, in cryosparc2_compute.engine.engine.process.work (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:28374)
File “cryosparc2_worker/cryosparc2_compute/engine/engine.py”, line 308, in cryosparc2_compute.engine.engine.EngineThread.compute_resid_pow (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/engine.c:11165)
File “cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py”, line 293, in cryosparc2_compute.engine.cuda_core.EngineBaseThread.ensure_allocated (/home/installtest/deps_manage/cryosparc2_package/deploy/stage/cryosparc2_worker/cryosparc2_compute/engine/cuda_core.c:9489)
LogicError: cuMemHostAlloc failed: OS call failed or operation not supported on this OS
The two existing forum threads that I’ve found are not helpful:
Any ideas what may be the root of the issue? Thanks in advance!