Error compiling kernel

sbliven · August 14, 2020, 8:36pm

I am running into the following cuda error when running certain job types on certain GPU nodes.

**** handle exception rc
set status to failed
Traceback (most recent call last):
  File "cryosparc2_worker/cryosparc2_compute/run.py", line 78, in cryosparc2_compute.run.main
  File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 62, in cryosparc2_compute.jobs.template_picker_gpu.run.run
  File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 201, in cryosparc2_compute.jobs.template_picker_gpu.run.do_pick
  File "cryosparc2_worker/cryosparc2_compute/jobs/template_picker_gpu/run.py", line 261, in cryosparc2_compute.jobs.template_picker_gpu.run.do_pick
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py", line 275, in cryosparc2_compute.jobs.motioncorrection.cuda_kernels.do_fcrop_shift_accumulate_gpu
  File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", line 362, in cryosparc2_compute.engine.cuda_core.context_dependent_memoize.wrapper
  File "cryosparc2_worker/cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py", line 267, in cryosparc2_compute.jobs.motioncorrection.cuda_kernels.get_fcrop_shift_accumulate_gpu
  File "cryosparc2_worker/deps/anaconda/lib/python2.7/site-packages/pycuda/compiler.py", line 294, in __init__
    self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - error   : Binary format for key='0', ident='' is not recognized
========= main process now complete.

This has been discussed several times already on the forum:

Kernel panic - linux box with 2 GPUS proposed a solution involving changing AVX settings. I have a different CPU, and my BIOS doesn’t provide these offsets. Plus, what does AVX have to do with Cuda?
Cryosparc unable to run any 2D or 3D job suggests CUDA version mismatches. I’ve double checked this (see below) and it seems to be fine.
Error during refinement with C4 symmetry didn’t find a solution

It seems like this is a common issue, so I would like to understand the cause more thoroughly.

I am able to reproduce the issue from python:

eval `bin/cryosparcw env`
cd "$CRYOSPARC_ROOT_DIR"
python -c "import cryosparc2_compute.run as run; run.run()" --project P3 --job J14 --master_hostname merlin-l-002.psi.ch --master_command_core_port 39122

Since it seems to be a problem in pycuda, I use the following script (derived from their tutorial) to verify that pycuda works and can execute basic CUDA code:

import pycuda.autoinit
import pycuda.driver as drv
import numpy

print("Cuda Version: {}".format(drv.get_version()))
print("Driver Version: {}".format(drv.get_driver_version()))
print("GPU({}): {} # {}".format(pycuda.autoinit.device.count(), pycuda.autoinit.device.name(), pycuda.autoinit.device.pci_bus_id()))

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
    const int i = threadIdx.x;
    dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print(dest - a*b)

This prints the expected cuda toolkit and driver versions, as well as the expected [0...] output. So it appears basic pycuda works.

My next step would normally be to look at the cuda code from cryosparc2_compute/jobs/motioncorrection/cuda_kernels.py:267, but this is from a compiled extension in cryosparc. Is the source for that available anywhere, or is it considered proprietary? If the latter, could the devs provide any hints about what cuda features might be in use there that are causing the error?

BTW, a secondary bug report: cryosparcw exits with status 0 after this error. Thus it is reported as a “successful” job in our monitoring system.

The errors seem to happen on nodes with Nvidia GTX 1080 or 1080Ti GPUs. Weirdly it was successful on a different node with an RTX 2080Ti. However we have few 2080Ti nodes available, so I’d like to solve it on the older models.

Current cryoSPARC version: v2.13.2
GPUs with errors: Nvidia GTX 1080, Nvidia GTX 1080Ti
GPUs without errors: Nvidia RTX 2080Ti
Cuda Driver: 440.64.00 (all systems)
Cuda Toolkit: 10.0.130 (all systems)

sbliven · August 19, 2020, 2:57pm

I’ve determined that the failing jobs (e.g. blob picker, template picker, probably others) fail consistently on GTX 1080 and 1080Ti GPUs, and complete successfully on RTX 2080Ti cards. That’s weird, since all machines have the same driver and toolkit versions installed, and it should be compatible with all three cards.

Some more info about the cuda code being executed would be really helpful here, but I’m unsure how to get it.

gebauer · August 15, 2023, 1:09pm

Hi, I am curretnly trying to include an “older” GTX1080Ti system in my otherwise working CryoSparc installation. I have the same problem with pycuda and it works fine on our 3090 machines. Have you managed to get your systems runnig?

Best
Jan

sbliven · August 17, 2023, 2:02pm

Yes, this is solved for us now and the 1080s work fine. I think it just went away at some point after a combination of driver updates and cryosparc updates. We’re now running driver 530.30.02 and cuda 11.5.1 with cryosparc 4.2.1, if that helps.

gebauer · August 17, 2023, 2:17pm

Thanks a lot.

I just solved it by installing Cuda 10.1 and Nvidia 530.30.02 and CS 4.3.

Maybe I’ll try CUDA 11.5.1 at some point, but in the moment I am just happy that I found a combination that worked. If you ever update your CUDA - I would be interested to hear if it works for you.
I had no luck with CUDA 11.8 and I guess I also tried CUDA 11.4, but maybe the Driver is more important here than the CUDA version.

Thanks again for your reply
Jan