Sudden - Job is unresponsive - no heartbeat received in 30 seconds

All of a sudden my jobs won’t run. I get this for heterogeneous and homogeneous refinements. Jobs failed twice in a row with different errors

First het ref
[CPU: 2.58 GB] Using Alignment Radius 16.600 (20.000A)
[CPU: 2.58 GB] Using Reconstruction Radius 24.900 (13.333A)
[CPU: 2.58 GB] Randomizing assignments for identical classes…
[CPU: 2.58 GB] Number of BnB iterations 2
[CPU: 2.58 GB] Engine Started.
[CPU: 907.1 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/hetero_refine/run.py”, line 279, in cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine
File “cryosparc_worker/cryosparc_compute/jobs/hetero_refine/run.py”, line 257, in cryosparc_compute.jobs.hetero_refine.run.run_hetero_refine.process_images
File “cryosparc_worker/cryosparc_compute/engine/engine.py”, line 907, in cryosparc_compute.engine.engine.process
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 33, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

Second time (cloned)
[CPU: 7.07 GB] Number of BnB iterations 3
[CPU: 7.07 GB] DEV 0 THR 1 NUM 1000 TOTAL 4.9532749 ELAPSED 3.9750611 –
[CPU: 7.66 GB] Processed 1500.000 images with 3 models in 5.470s.
[CPU: 7.66 GB] DEV 0 THR 0 NUM 1000 TOTAL 4.8519780 ELAPSED 3.9477028 –
Job is unresponsive - no heartbeat received in 30 seconds.
[CPU: 8.25 GB] Processed 1500.000 images with 3 models in 5.458s.
[CPU: 8.25 GB] – Effective number of classes per image: min 1.07 | 25-pct 2.51 | median 2.75 | 75-pct 2.90 | max 3.00
[CPU: 8.25 GB] – Class 0: 34.12%

First hom ref
[CPU: 1.69 GB] ====== Initial Model ======
[CPU: 1.69 GB] Resampling initial model to specified volume representation size and pixel-size…
[CPU: 1.89 GB] Aligning initial model to symmetry.
[CPU: 1.59 GB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 84, in cryosparc_compute.run.main
File “cryosparc_worker/cryosparc_compute/jobs/refine/newrun.py”, line 330, in cryosparc_compute.jobs.refine.newrun.run_homo_refine
File “/home/hendlab/software/cryosparc2/cryosparc2_worker/cryosparc_compute/alignment.py”, line 188, in align_symmetry
cuda_core.initialize([cuda_dev])
File “cryosparc_worker/cryosparc_compute/engine/cuda_core.py”, line 33, in cryosparc_compute.engine.cuda_core.initialize
pycuda._driver.LogicError: cuDeviceGet failed: invalid device ordinal

The cryosparc GUI shows the second hom ref as failed, but no error message and htop shows the job is running.

what does nvidia_smi show? this would seem to indicate csparc can’t find a GPU with the specified ID - have you tried resetting the GPUs?

We did have to reboot yesterday due to error “unable to determine the device handle for GPU.”
I did not see this error today and cryosparcm was not running this morning after the job failed overnight.
It seems to be running fine now with no change. Maybe a hardware issue?

nvidia-smi around the time of error
Tue Jul 20 10:08:28 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:0A:00.0 Off | N/A |
| 24% 32C P8 21W / 250W | 5MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:42:00.0 Off | N/A |
| 24% 30C P8 1W / 250W | 123MiB / 11011MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1876 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1876 G /usr/lib/xorg/Xorg 94MiB |
| 1 N/A N/A 2042 G /usr/bin/gnome-shell 26MiB |
±----------------------------------------------------------------------------+