cuMemcpyHtoDAsync failed

BWise · February 21, 2022, 6:09pm

Hello!
We have started getting this error message when running various jobs (see
the example screenshot below from a blob picking job).

After the cryosparc job fails, I get this message from systemlogd:
kernel:[253932.878477] watchdog: BUG: soft lockup - CPU#2 stuck for 48s! [gnome-shell:1808] (this message keeps repeating with a different CPU# until the system is physically rebooted).

We are not sure what the problem is… Some googling led me to believe it was something to do with the swapfile. However, deleting or replacing it had no effect. Also, if I monitor the system with htop while the job is running I notice the memory cache slowly increases until it reaches the maximum at which point the job crashes…

If anyone has encountered this before or any ideas on what to do, it would be very helpful!

thanks!
Ben

wtempel · February 22, 2022, 4:58pm

@BWise Which other job types gave you this same cuMemcpyHtoD error?
Is there anything remarkable about the input data, such as the size of the structure determination target, the data format or the data quality?
Please can you also provide:

GPU model(s) and driver version (nvidia-smi)
Linux kernel (uname -a)
DRAM size (free -g)
worker CUDA version:

eval $(cryosparc_worker/bin/cryosparcw env)
python -c "import pycuda.driver; print(pycuda.driver.get_version())

BWise · February 22, 2022, 6:15pm

Hi, thanks for the reply!

No, I don’t think there is anything remarkable about this data. The target is about 120 kDa, and 200-300 angstoms in size, the raw movies are in .tiff format. It is a new target and I just recently collected a first dataset of about 8000 movies. I followed my usual workflow: import movies → patch motion → patch CTF → curate exposures, all without problems. The error appeared after.

Error occurs so far during Blob picking and Extract from micrographs.
cryosparc is up to date with the latest patches.
GPU model(s): 2xQuadro RTX5000 and driver version: 495.29.05
Linux kernel: 5.13.0-30-generic
worker CUDA version: (11, 5, 0)
DRAM size:
total used free shared buff/cache available
Mem: 251 2 150 0 98 247
Swap: 255 0 255

thanks for your help!

wtempel · February 23, 2022, 5:48pm

@BWise As a first step in troubleshooting this issue, please configure your worker and test the failed job with CUDA-11.2:

BWise · February 24, 2022, 4:50pm

Hi,
I followed your advice and configured the worker to use CUDA-11.2. I cloned and reran the extraction job. It worked!!

Again I monitored the system with htop while the job was running and although the memory cache reached the maximum like last time, the job did not crash. 2D classification is running now, so far without any problems. So hopefully that solves the problem

One thing though: when I tried to change the CUDA path with cryosparcw i get an error “cryosparcw: command not found”. Am I missing something?
For this I used nano to edit the cryosparc_worker/config.sh file, but for future reference it would be useful to know why the cryosparcw command does not work for me.

thanks alot!

wtempel · February 24, 2022, 6:08pm

@BWise unless <path-to-cryosparc-worker-software>/bin has been added to the PATH environment variable, you should run (based on the output you posted earlier):
/usr/local/cryosparc/cryosparc/cryosparc_worker/bin/cryosparcw newcuda <cuda-path>

BWise · February 25, 2022, 11:25am

ah yes.
my mistake
that works, thanks!

BWise · February 25, 2022, 12:50pm

Hi again, Unfortunately I spoke too soon.

2D classification ran without problems.
However, the exact same error described above occurred again when extracting the same particles with a smaller box size.

originally with CUDA 11.5 the extraction job would consistently crash around 800 micrographs out of 4000. Now with CUDA 11.2 the job crashed around 3300 out of 4000. Why it completed successfully the first time with a larger boxsize. I do not know…

Other ideas? thanks again!
Ben

wtempel · February 25, 2022, 9:59pm

@BWise Unfortunately, I don’t have any suggestions from the software side at this time. Are you confident in the integrity of your GPU hardware, i.e., non of the GPUs is failing?

BWise · March 3, 2022, 10:35am

Hi
Yes thanks.
While monitoring the GPUs with nvtop during a running job I did indeed notice one of the GPUs was exhibiting strange behavior. So, a failing GPU seems to be the culprit

So far I have not seen this error anymore if I avoid using this failing GPU