Failing jobs suddenly

Hello,

For some reason I’ve been getting the following when running Cryosparc in two different workstations:

Up until this point, all jobs run smoothly. Then, suddenly, jobs crash and fail. I see that the temp folder has been filled up. After clearing some space, the failure remains and I cannot run any job anymore.

For instance, this is the error I’m getting now:

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/local_refine/newrun.py", line 401, in cryosparc_compute.jobs.local_refine.newrun.run_local_refine
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2877, in cryosparc_compute.engine.newengine.get_initial_noise_estimate
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 2897, in cryosparc_compute.engine.newengine.get_initial_noise_estimate
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 538, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newengine.py", line 532, in cryosparc_compute.engine.newengine.EngineThread.preprocess_image_data
  File "cryosparc_master/cryosparc_compute/engine/newgfourier.py", line 22, in cryosparc_compute.engine.newgfourier.get_plan_R2C_2D
  File "/data/loewith/tafurpet/software/cryosparc/cryosparc_worker/cryosparc_compute/skcuda_internal/fft.py", line 115, in __init__
    self.handle = gpufft.gpufft_get_plan(
RuntimeError: cuFFT failure: cufftSetStream(plan_cache.plans[idx].handle, device_stream)
-> CUFFT_INVALID_PLAN

Again, everything runs normally up until this point.

What can be the reason and how can I solve it?

Thanks in advance for your help.

As an update, I also performed these tests but unfortunately not any information on how to solve them:

test workers P22 --test-pytorch Using project P22
Enabling PyTorch test
Running worker tests…
2023-06-15 11:51:37,299 WORKER_TEST log CRITICAL | Worker test results
2023-06-15 11:51:37,299 WORKER_TEST log CRITICAL | fagus
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | ✕ LAUNCH
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Error:
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | See P22 J148 for more information
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | :warning: SSD
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Did not run: Launch test failed
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | :warning: GPU
2023-06-15 11:51:37,300 WORKER_TEST log CRITICAL | Did not run: Launch test failed

test workers P22 --test-tensorflow
Using project P22
Enabling Tensorflow test
Running worker tests…
2023-06-15 11:55:22,202 WORKER_TEST log CRITICAL | Worker test results
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | fagus
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | ✕ LAUNCH
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Error:
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | See P22 J149 for more information
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | :warning: SSD
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Did not run: Launch test failed
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | :warning: GPU
2023-06-15 11:55:22,203 WORKER_TEST log CRITICAL | Did not run: Launch test failed

Please can you provide additional information:

  • version and patch information for your CryoSPARC instance
  • the path to the temp folder you mentioned. Did you mean /tmp?
  • do all jobs fail with cuFFT failure?
  • have there been any software updates on the workstations, for example automatic OS updates?

Sorry for the late reply:

  • v4.2.1 (March 15 2023)
  • Yes, /tmp
  • Yes, all jobs that require GPU fail with the same error
  • Not sure about that, but don’t think so

What has solved the issue is restarting the workstation and/or reinstalling cryoSPARC, but it is strange as other users have had the same issue randomly (happened without any change to the workstation).

Thanks

So as of now, the issue is resolved?

Could there still have been any automated system or driver software updates?

So as of now, the issue is resolved?

Yes.

Could there still have been any automated system or driver software updates?

Perhaps; it appears that it is happening to different users (with different instances of cryoSPARC) in the same workstation when they start to use cryoSPARC again to process after a while.