I am running CryoSPARC v4.0.1 and recently I have been getting the following error from time to time during 2D-classification and 3D-ab initio reconstruction jobs.
Traceback (most recent call last): File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1925, in run_with_except_hook run_old(*args, **kw) File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1082, in cryosparc_compute.engine.engine.process.work File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 329, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 313, in cryosparc_compute.engine.cuda_core.EngineBaseThread.toc File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 309, in cryosparc_compute.engine.cuda_core.EngineBaseThread.wait pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered
I have 2 Nvidia GTX 1080 graphics cards running the driver version: 520.61.05 and CUDA version 11.8.
The job that crashed had around 650k particles which were Fourier cropped to 128 px, so this job is not extremely demanding on the hardware…
Unfortunately, I could not identify what causes this error since sometimes the jobs run fine and sometimes they all crash, even with jobs with as little as 5000 particles which I ran to see if there is a data-size issue.
We have a few more questions that may help us find out what’s wrong:
Did you observe the error on any other dataset(s)?
Is there one GPU such that the error can reliably be avoided if you directly target the job to that GPU?
Did you experience this error on CryoSPARC before upgrading to v4?
Yes, this error happened for different datasets of various size. However, the error occurred rarely and simply restarting the job helped and it would complete fine. But recently I got this error a lot more frequently. The error also occurred on the previous version I was running which was v.3.3.2.
In addition, I was trying to find out if it failed on a specific GPU. To test I ran a 2D-classification with 50 classes and 290k particles It completed on one GPU but when I ran it on the other GPU the whole PC crashed. However, after a reboot I launched the same job again on the same GPU where the job failed earlier and it completed successfully.
I also noticed that the troubleshooting website states that CUDA versions greater than 11.7.1 are not supported by CryoSPARC. I am using version 11.8, could that be a problem?
I also monitored the GPU temperature during the time of the second run on the GPU that caused the crash, but its temperature never went higher than 80 °C. Also only less than half of the available GPU memory was used (max. 2.5 GB / 8 GB).
I noticed that the required specs for CryoSPARC say that a GPU with 11 GB RAM or more is needed. However, I have 2 Nvidia GTX 1080 with 8 GB each. Could that be a problem?
It seems strange to me, since sometimes also jobs with only 50k particles fail and sometimes all kinds of jobs complete without any issues.
We decided to downgrade to version 3.4.0 and that seemed to run smoothly. However, sometimes I still get the same error and I cannot tell why this is happening. We also made sure that the CUDA version is now as suggested on the webpage (we now run v11.6).
It would still be nice to hear back if these issues could be because of the specs of our workstation, i.e. that cryosparc v4 really needs that much more power to run.
I think wtempel is right to suspect the PSU. A failing or faulty PSU can manifest in all sorts of odd behaviour. (Bad memories after moving home for that…)
Another option is RAM. I would boot to memtest and run for at least two full passes (although errors usually manifest before that).
The other option is test the GPU with something like RTHDRIBL. That has always shown GPU errors (usually memory related) for me, and isn’t too demanding, unlike Furmark or other power-virus-like GPU stress tests, which I avoid like the plague as their loads are utterly unrealistic. The slightly older “Heaven” GPU benchmark might also be useful.
If RAM and GPU pass (the GPU test should be run for at least 20 minutes) testing, find a new PSU and see if the problem continues.