Error during 2D-classification

MichaelZ · October 17, 2022, 12:24pm

Hi,
I am running CryoSPARC v4.0.1 and recently I have been getting the following error from time to time during 2D-classification and 3D-ab initio reconstruction jobs.

Traceback (most recent call last): File "/home/cryosparc/cryosparc_worker/cryosparc_compute/jobs/runcommon.py", line 1925, in run_with_except_hook run_old(*args, **kw) File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 131, in cryosparc_compute.engine.cuda_core.GPUThread.run File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 132, in cryosparc_compute.engine.cuda_core.GPUThread.run File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 1082, in cryosparc_compute.engine.engine.process.work File "cryosparc_worker/cryosparc_compute/engine/engine.py", line 329, in cryosparc_compute.engine.engine.EngineThread.compute_resid_pow File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 313, in cryosparc_compute.engine.cuda_core.EngineBaseThread.toc File "cryosparc_worker/cryosparc_compute/engine/cuda_core.py", line 309, in cryosparc_compute.engine.cuda_core.EngineBaseThread.wait pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered

I have 2 Nvidia GTX 1080 graphics cards running the driver version: 520.61.05 and CUDA version 11.8.
The job that crashed had around 650k particles which were Fourier cropped to 128 px, so this job is not extremely demanding on the hardware…
Unfortunately, I could not identify what causes this error since sometimes the jobs run fine and sometimes they all crash, even with jobs with as little as 5000 particles which I ran to see if there is a data-size issue.

Any helpful inputs are greatly appreciated.

Cheers
Michael

wtempel · October 17, 2022, 1:40pm

Does the worker node also handle GPU workloads that are launched independently of the CryoSPARC scheduler?

MichaelZ · October 17, 2022, 2:43pm

This PC is only used for cryosparc. Unfortunately, I don’t have anything else that uses the GPU intensively to test if it is because of the load.

wtempel · October 17, 2022, 5:41pm

We have a few more questions that may help us find out what’s wrong:
Did you observe the error on any other dataset(s)?
Is there one GPU such that the error can reliably be avoided if you directly target the job to that GPU?
Did you experience this error on CryoSPARC before upgrading to v4?

wtempel · October 18, 2022, 1:39pm

A post was split to a new topic: cufftInternalError during 2d classification

MichaelZ · October 18, 2022, 9:25am

Yes, this error happened for different datasets of various size. However, the error occurred rarely and simply restarting the job helped and it would complete fine. But recently I got this error a lot more frequently. The error also occurred on the previous version I was running which was v.3.3.2.

In addition, I was trying to find out if it failed on a specific GPU. To test I ran a 2D-classification with 50 classes and 290k particles It completed on one GPU but when I ran it on the other GPU the whole PC crashed. However, after a reboot I launched the same job again on the same GPU where the job failed earlier and it completed successfully.

I also noticed that the troubleshooting website states that CUDA versions greater than 11.7.1 are not supported by CryoSPARC. I am using version 11.8, could that be a problem?

I also monitored the GPU temperature during the time of the second run on the GPU that caused the crash, but its temperature never went higher than 80 °C. Also only less than half of the available GPU memory was used (max. 2.5 GB / 8 GB).

wtempel · October 18, 2022, 2:42pm

That to me suggests a hardware or power issue.

MichaelZ · October 24, 2022, 12:05pm

I noticed that the required specs for CryoSPARC say that a GPU with 11 GB RAM or more is needed. However, I have 2 Nvidia GTX 1080 with 8 GB each. Could that be a problem?
It seems strange to me, since sometimes also jobs with only 50k particles fail and sometimes all kinds of jobs complete without any issues.

MichaelZ · October 27, 2022, 1:21pm

We decided to downgrade to version 3.4.0 and that seemed to run smoothly. However, sometimes I still get the same error and I cannot tell why this is happening. We also made sure that the CUDA version is now as suggested on the webpage (we now run v11.6).

It would still be nice to hear back if these issues could be because of the specs of our workstation, i.e. that cryosparc v4 really needs that much more power to run.

wtempel · December 14, 2022, 10:30pm

I have not compared the demands on hardware between similar jobs run with version 3 or 4, respectively, but my intuition is that v4 does not need

Your findings suggest that neither the complexity of computational problems nor the software version are reliable predictors of job failures or even system crashes. You may want to try eliminate

power supply issues (the computer’s power supply unit(s) and the power rating of the circuit to which the computer is connected)
faulty hardware. Various stress tests for various computer components are available. Carefully evaluate potential risks of specific tests before deciding to run any test

rbs_sci · December 15, 2022, 2:03am

Intermittent errors are the hardest to pin down…

I think wtempel is right to suspect the PSU. A failing or faulty PSU can manifest in all sorts of odd behaviour. (Bad memories after moving home for that…)

Another option is RAM. I would boot to memtest and run for at least two full passes (although errors usually manifest before that).

The other option is test the GPU with something like RTHDRIBL. That has always shown GPU errors (usually memory related) for me, and isn’t too demanding, unlike Furmark or other power-virus-like GPU stress tests, which I avoid like the plague as their loads are utterly unrealistic. The slightly older “Heaven” GPU benchmark might also be useful.

If RAM and GPU pass (the GPU test should be run for at least 20 minutes) testing, find a new PSU and see if the problem continues.