RTX 6000 Ada - device kernel image is invalid

Thanks @UCBKurt.
Please can you also post the output of this command:

/cryosparc-worker/cryosparc_worker/bin/cryosparcw call python -c $'import time;from pycuda import driver;from pycuda.compiler import SourceModule;driver.init();ctx = driver.Device(0).retain_primary_context();ctx.push()\ntry:print(SourceModule("__global__ void f(float *a, float val) { }").get_function("f"))\nexcept e: print(e)\nfinally:ctx.pop();time.sleep(10)' & \
  (CSPID=$! && sleep 5 && cat /proc/${CSPID}/maps | grep cu)

Sorry for the delay, here is the output:

pycuda._driver.Function object at 0x7f02e9d6b0b0>
7f02e3400000-7f02e8072000 r-xp 00000000 08:12 48005144                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libcurand.so.10.3.0.86
7f02e8072000-7f02e8272000 ---p 04c72000 08:12 48005144                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libcurand.so.10.3.0.86
7f02e8272000-7f02e8278000 r--p 04c72000 08:12 48005144                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libcurand.so.10.3.0.86
7f02e8278000-7f02e96a4000 rw-p 04c78000 08:12 48005144                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libcurand.so.10.3.0.86
7f02e9cdc000-7f02e9cdd000 rw-p 060a4000 08:12 48005144                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/libcurand.so.10.3.0.86
7f02e9e00000-7f02e9edc000 r--p 00000000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02e9edc000-7f02ea382000 r-xp 000dc000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02ea382000-7f02eb967000 r--p 00582000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02eb967000-7f02eb968000 ---p 01b67000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02eb968000-7f02eb97f000 r--p 01b67000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02eb97f000-7f02eba85000 rw-p 01b7e000 08:12 1085705                    /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
7f02ebc06000-7f02ebcdc000 r--p 00000000 08:12 48377361                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/_driver.cpython-38-x86_64-linux-gnu.so
7f02ebcdc000-7f02ebd6f000 r-xp 000d6000 08:12 48377361                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/_driver.cpython-38-x86_64-linux-gnu.so
7f02ebd6f000-7f02ebdab000 r--p 00169000 08:12 48377361                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/_driver.cpython-38-x86_64-linux-gnu.so
7f02ebdab000-7f02ebdb6000 r--p 001a5000 08:12 48377361                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/_driver.cpython-38-x86_64-linux-gnu.so
7f02ebdb6000-7f02ebdbf000 rw-p 001b0000 08:12 48377361                   /cryosparc-worker/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.8/site-packages/pycuda/_driver.cpython-38-x86_64-linux-gnu.so

We have so far not been able to determine a reliable cure for the cuModuleLoadDataEx failed problem. That said, my next suggestion would be to run CryoSPARC in a simplified environment.
Caution: Following this suggestion is not guaranteed to establish the desired CryoSPARC functionality and may adversely affect other functionality. Moreover, according to the related post 2d classification kernel error - #8 by sheff_diamond_em, simplification was apparently not necessary to resolve cuModuleLoadDataEx failed.
But, in case you want to try it:

  1. Omit cuda directories in PATH or LD_LIBRARY_PATH definitions (outside the cryosparcw environment). cryosparcw alone should handle inclusion of cuda-related directories on those variables’ definitions.
  2. /sbin/ldconfig -p output should not include any libraries inside cuda directories. I would I aim for output like this (note the absence of libraries under /usr/local/cuda):
$ $ /sbin/ldconfig -p|grep -e libcu -e cuda
	libicudata.so.70 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.70
	libcurl.so.4 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcurl.so.4
	libcurl-gnutls.so.4 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcurl-gnutls.so.4
	libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
	libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so

I wonder whether the presence of i386 libraries could be a problem (it apparently wasn’t in the aforementioned related post):

See ubuntu or Linux documentation for information on ldconfig and related configuration files.
To keep track of experimental changes to configuration files, you may find etckeeper helpful.

So, it’s the exact same error after simplifying everything. We need the i386 libraries due to another application, so removing them isn’t possible. Also, we don’t have this issue on other machines using i386 libraries. Only on our RTX 6000 Ada and RTX 4000 SFF machines. Could the issue just be that these GPUs require CUDA 12 (since driver 520 doesn’t even recognize them)?

According to nVidia’s documentation, Ada is officially support with CUDA 11.6 (with a supported driver) so I don’t think that’s the issue.

The RTX 6000 Ada has only been supported since driver 525, so 520 not identifying it is unsurprising.

2D classification is failing? So motion correction, CTF estimation and blob picking work and provide results which look appropriate? If you just feed the particles to ab initio or 3D (with [pick a reference]) does it also crash?

How interesting. Other documentation suggests that v11.8 is needed. It would be interesting to hear from other users about their experience with Hopper or Ada Lovelace devices.

Just tested and CUDA 11.6 and 11.7 fail to recognize the GPUs. Seems like 11.8 is the oldest that will work.

2D classification is failing? So motion correction, CTF estimation and blob picking work and provide results which look appropriate? If you just feed the particles to ab initio or 3D (with [pick a reference]) does it also crash?

Nothing else outright fails, but local refinement randomly freezes partway through. Doesn’t seem to be consistent when it happens. The freezing happens on different datasets as well, not just one specific one.

Does this happen with caching enabled and a local nvme cache device?
Is there plenty of RAM “available” when that happens?

Does this happen with caching enabled and a local nvme cache device?

Yes, the server has 16TB of U.3 NVMe dedicated specifically for CryoSPARC

Is there plenty of RAM “available” when that happens?

Yes, free says there is over 500GB of memory free and available for use

Thanks for this information. What does the Event Log show at that point? Does the job log (Metadata | Log) show

  • any errors
  • evidence for continuing heartbeats?

No errors. The heartbeats continue to show up, but there are no more cufft logs in the joblog and the web ui doesn’t update.

It hung at Computing FFTs on GPU. this time, but I’ve seen it hang at Non-uniform regularization with compute option: GPU at other points.

Yes, that confused me as well because I was sure 11.8 was the only CUDA 11 with official support for Lovelace. I guess PR and development didn’t communicate well about that…

@UCBKurt Random hangs doesn’t sound like a CryoSPARC issue directly, but a system issue which CryoSPARC is managing to make appear. What do you see in system logs? I’m thinking explicitly about PCI-E bus errors, memory errors or the GPUs “dropping off the bus”. If you’re not getting problems consistently on everything which CryoSPARC is doing for the GPU… hm. I recently had a problem where CryoSPARC was the only thing which had issues… turned out it was two bad sticks of RAM, where the ECC had managed to cover it up during early testing but after a few months of heavy use they became uncorrectable.

It’s times like these I fall back to stress testing methodologies. Prime95, y-cruncher for CPU/RAM, memtest, UniEngine Heaven/RTHDLIBL for the GPUs…

1 Like

Stress testing is actually what I tried first. Didn’t find any hardware issues, plus it’s an entirely brand new machine. Only local refinement and 2d classification have issues on it.

Is there any ETA on when this will be addressed? Either through a patch or pycuda update? CryoSPARC uses pycuda 2020.1 which may be another reason why it fails.

@wtempel @rbs_sci We got our hands on a 4090 and can confirm that it does not have any issues with 11.8 and CryoSPARC. The problem happens exclusively with the RTX 4000 SFF and the RTX 6000 Ada.

Please advise on when Cuda 12 support will be added to CryoSPARC. Right now our RTX 6000 Ada server is mostly useless since CryoSPARC can’t run on it.

We do not currently have an ETA for CUDA v12 support in CryoSPARC. I will send you a direct message with another suggestion based on the specific circumstances you described so far.

Hi,

looks like that something does not recognize the architecture sm_89 of the gpu?
I mean when i search for this cubin stuff then this is as far as i understand to check/compile the compute capability/code/arch… of the gpu model.
0Capture

What happens when you run under
cuda version: 11.8
Driver: 525

cryosparc_worker/bin/cryosparcw call python /path/to/hello_gpu.py from here pycuda 2022.2.2 documentation
Can you send the output when you run: cryosparc_worker/bin/cryosparcw call python /path/to/code.py

import pycuda
import pycuda.driver as cuda
import pycuda.autoinit
cuda.init()


print('Detected{}CUDACapabledevice(s)\n'.format(cuda.Device.count()))

for i in range(cuda.Device.count()):

    gpu_device = cuda.Device(i)
    print('Device {}: {}'.format( i, gpu_device.name() ))
    compute_capability=float( '%d.%d' % gpu_device.compute_capability() )
    print('\t Compute Capability: {}'.format(compute_capability))
    print('\t Total Memory: {} megabytes'.format(gpu_device.total_memory()//(1024**2)))

print('%d device(s) found.' % cuda.Device.count())
dev=cuda.Device(0)
print('Device: %s', dev.name())
print(' Compute Capability: %d.%d' % dev.compute_capability())
print(' Total Memory: %s KB' % (dev.total_memory()//(1024)))
atts=[(str(att), value)
 for att, value in dev.get_attributes().items()]
atts.sort()

for att, value in atts:
 print(' %s: %s' % (att, value))

Maybe this post will help?
https://discuss.cryosparc.com/t/cryosparc-unable-to-run-any-2d-or-3d-job/4391/4?u=mo_o

Best,

Mo

Unfortunately, we do not currently have access to these models. We hypothesize that supporting these models requires CUDA version > 11.8 and are working on supporting a newer CUDA version. Unfortunately, we still do not have an ETA for that.
[Update 2023-09-12: hypothesis regarding CUDA > 11.8 likely incorrect]

As a general note to anyone else who might be experiencing this same issue, this will be resolved in a future release of cryosparc.

1 Like

Everyone who was experiencing this issue is encouraged to update to CryoSPARC v4.4, which should resolve it!

2 Likes