Failing jobs suddenly

LTP · June 15, 2023, 11:27am

Hello,

For some reason I’ve been getting the following when running Cryosparc in two different workstations:

Up until this point, all jobs run smoothly. Then, suddenly, jobs crash and fail. I see that the temp folder has been filled up. After clearing some space, the failure remains and I cannot run any job anymore.

For instance, this is the error I’m getting now:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/local_refine/newrun.py", line 401, in cryosparc_compute.jobs.local_refine.newrun.run_local_refine
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2877, in cryosparc_compute.engine.newengine.get_initial_noise_estimate
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2897, in cryosparc_compute.engine.newengine.get_initial_noise_estimate
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 538, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 532, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newgfourier.py", line 22, in cryosparc_compute.engine.newgfourier.get_plan_R2C_2D
  File "/data/loewith/tafurpet/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 115, in __init__
    self.handle = gpufft.gpufft_get_plan(
RuntimeError: cuFFT failure: cufftSetStream(plan_cache.plans[idx].handle, device_stream)
-> CUFFT_INVALID_PLAN

Again, everything runs normally up until this point.

What can be the reason and how can I solve it?

Thanks in advance for your help.

LTP · June 15, 2023, 12:22pm

As an update, I also performed these tests but unfortunately not any information on how to solve them:

test workers P22 --test-pytorch Using project P22
Enabling PyTorch test
Running worker tests…
2023-06-15 11:51:37,299 WORKER_TEST log CRITICAL | Worker test results
2023-06-15 11:51:37,299 WORKER_TEST log CRITICAL | fagus
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | ✕ LAUNCH
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Error:
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | See P22 J148 for more information
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | SSD
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Did not run: Launch test failed
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | GPU
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Did not run: Launch test failed

test workers P22 --test-tensorflow
Using project P22
Enabling Tensorflow test
Running worker tests…
2023-06-15 11:55:22,202 WORKER_TEST log CRITICAL | Worker test results
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | fagus
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | ✕ LAUNCH
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Error:
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | See P22 J149 for more information
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | SSD
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Did not run: Launch test failed
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | GPU
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Did not run: Launch test failed

wtempel · June 15, 2023, 5:34pm

Please can you provide additional information:

version and patch information for your CryoSPARC instance
the path to the temp folder you mentioned. Did you mean /tmp?
do all jobs fail with cuFFT failure?
have there been any software updates on the workstations, for example automatic OS updates?

LTP · June 20, 2023, 9:57pm

Sorry for the late reply:

v4.2.1 (March 15 2023)
Yes, /tmp
Yes, all jobs that require GPU fail with the same error
Not sure about that, but don’t think so

What has solved the issue is restarting the workstation and/or reinstalling cryoSPARC, but it is strange as other users have had the same issue randomly (happened without any change to the workstation).

Thanks

wtempel · June 21, 2023, 1:47pm

So as of now, the issue is resolved?

Could there still have been any automated system or driver software updates?

LTP · June 21, 2023, 3:08pm

So as of now, the issue is resolved?

Yes.

Could there still have been any automated system or driver software updates?

Perhaps; it appears that it is happening to different users (with different instances of cryoSPARC) in the same workstation when they start to use cryoSPARC again to process after a while.