I am running into the following cuda error when running certain job types on certain GPU nodes.
**** handle exception rc
set status to failed
Traceback (most recent call last):
File "cryosparc2_worker/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 62, in cryosparc2_compute.jobs.template_picker_gpu.run.run
File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 201, in cryosparc2_compute.jobs.template_picker_gpu.run.do_pick
File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 261, in cryosparc2_compute.jobs.template_picker_gpu.run.do_pick
File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py", line 275, in cryosparc2_compute.jobs.motioncorrection.cuda_kernels.do_fcrop_shift_accumulate_gpu
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 362, in cryosparc2_compute.engine.cuda_core.context_dependent_memoize.wrapper
File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py", line 267, in cryosparc2_compute.jobs.motioncorrection.cuda_kernels.get_fcrop_shift_accumulate_gpu
File "cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py", line 294, in __init__
self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - error : Binary format for key='0', ident='' is not recognized
========= main process now complete.
This has been discussed several times already on the forum:
- Kernel panic - linux box with 2 GPUS proposed a solution involving changing AVX settings. I have a different CPU, and my BIOS doesn’t provide these offsets. Plus, what does AVX have to do with Cuda?
- Cryosparc unable to run any 2D or 3D job suggests CUDA version mismatches. I’ve double checked this (see below) and it seems to be fine.
- Error during refinement with C4 symmetry didn’t find a solution
It seems like this is a common issue, so I would like to understand the cause more thoroughly.
I am able to reproduce the issue from python:
eval `bin/cryosparcw env`
cd "$CRYOSPARC_ROOT_DIR"
python -c "import cryosparc2_compute.run as run; run.run()" --project P3 --job J14 --master_hostname merlin-l-002.psi.ch --master_command_core_port 39122
Since it seems to be a problem in pycuda, I use the following script (derived from their tutorial) to verify that pycuda works and can execute basic CUDA code:
import pycuda.autoinit
import pycuda.driver as drv
import numpy
print("Cuda Version: {}".format(drv.get_version()))
print("Driver Version: {}".format(drv.get_driver_version()))
print("GPU({}): {} # {}".format(pycuda.autoinit.device.count(), pycuda.autoinit.device.name(), pycuda.autoinit.device.pci_bus_id()))
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print(dest - a*b)
This prints the expected cuda toolkit and driver versions, as well as the expected [0...]
output. So it appears basic pycuda works.
My next step would normally be to look at the cuda code from cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py:267
, but this is from a compiled extension in cryosparc. Is the source for that available anywhere, or is it considered proprietary? If the latter, could the devs provide any hints about what cuda features might be in use there that are causing the error?
BTW, a secondary bug report: cryosparcw exits with status 0 after this error. Thus it is reported as a “successful” job in our monitoring system.
The errors seem to happen on nodes with Nvidia GTX 1080 or 1080Ti GPUs. Weirdly it was successful on a different node with an RTX 2080Ti. However we have few 2080Ti nodes available, so I’d like to solve it on the older models.
Current cryoSPARC version: v2.13.2
GPUs with errors: Nvidia GTX 1080, Nvidia GTX 1080Ti
GPUs without errors: Nvidia RTX 2080Ti
Cuda Driver: 440.64.00 (all systems)
Cuda Toolkit: 10.0.130 (all systems)