Topaz Train Failing

We ran topaz manually on gpu49 with the A5000 GPUs; it ran for a while (10 minutes?) and it failed again with the following. Trying to understand why.

We have topaz version 0.2.4: and CryoSparc v4.4.1

cweidle@gpu49:~$ /software/cryosparc/topaz --version
apptainer exec --nv /software/cryosparc/topaz_latest.sif /usr/local/conda/bin/topaz --version
TOPAZ 0.2.4

The error seems to be:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function

cweidle@gpu49:/projects/em/cweidle/CS-connor-krios-q2-2023/J1413$ apptainer exec --nv /software/cryosparc/topaz_latest.sif /usr/local/conda/bin/topaz train --train-images /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/image_list_train.txt --train-targets /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/topaz_particles_processed_train.txt --test-images /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/image_list_test.txt --test-targets /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/topaz_particles_processed_test.txt --num-particles 20 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 341860829 --radius 3 --num-particles 20 --device 0 --no-pretrained --save-prefix=/projects/em/cweidle/CS-connor-krios-q2-2023/J1413/models/model -o /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/train_test_curve.txt

# Loading model: resnet8
# Model parameters: units=32, dropout=0.0, bn=on
# Receptive field: 71
# Using device=0 with cuda=True
WARNING: no coordinates are observed with x_coord > 86 or y_coord > 60. Did you scale the micrographs and particle coordinates correctly?
# Loaded 16 training micrographs with 171 labeled particles
# Loaded 4 test micrographs with 39 labeled particles
# source      split  p_observed num_positive_regions      total_regions
# 0 train     3.24e-06      4894   1508474880
# 0 test      2.98e-06       1123    377118720
# Specified expected number of particle per micrograph = 20.0
# With radius = 3
# Setting pi = 6.1519088736830675e-06
# minibatch_size=128, epoch_size=5000, num_epochs=10
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function
Traceback (most recent call last):
  File "/usr/local/conda/bin/topaz", line 11, in <module>
    load_entry_point('topaz-em==0.2.4', 'console_scripts', 'topaz')()
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/main.py", line 148, in main
    args.func(args)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 695, in main
    , save_prefix=save_prefix, use_cuda=use_cuda, output=output)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 577, in fit_epochs
    , use_cuda=use_cuda, output=output)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 557, in fit_epoch
    metrics = step_method.step(X, Y)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/methods.py", line 103, in step
    score = self.model(X).view(-1)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/classifier.py", line 28, in forward
    z = self.features(x)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 54, in forward
    z = self.features(x)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 272, in forward
    y = self.bn(y)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1623, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED