We ran topaz manually on gpu49 with the A5000 GPUs; it ran for a while (10 minutes?) and it failed again with the following. Trying to understand why.
We have topaz version 0.2.4: and CryoSparc v4.4.1
cweidle@gpu49:~$ /software/cryosparc/topaz --version
apptainer exec --nv /software/cryosparc/topaz_latest.sif /usr/local/conda/bin/topaz --version
TOPAZ 0.2.4
The error seems to be:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function
cweidle@gpu49:/projects/em/cweidle/CS-connor-krios-q2-2023/J1413$ apptainer exec --nv /software/cryosparc/topaz_latest.sif /usr/local/conda/bin/topaz train --train-images /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/image_list_train.txt --train-targets /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/topaz_particles_processed_train.txt --test-images /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/image_list_test.txt --test-targets /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/topaz_particles_processed_test.txt --num-particles 20 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 341860829 --radius 3 --num-particles 20 --device 0 --no-pretrained --save-prefix=/projects/em/cweidle/CS-connor-krios-q2-2023/J1413/models/model -o /projects/em/cweidle/CS-connor-krios-q2-2023/J1413/train_test_curve.txt
# Loading model: resnet8
# Model parameters: units=32, dropout=0.0, bn=on
# Receptive field: 71
# Using device=0 with cuda=True
WARNING: no coordinates are observed with x_coord > 86 or y_coord > 60. Did you scale the micrographs and particle coordinates correctly?
# Loaded 16 training micrographs with 171 labeled particles
# Loaded 4 test micrographs with 39 labeled particles
# source split p_observed num_positive_regions total_regions
# 0 train 3.24e-06 4894 1508474880
# 0 test 2.98e-06 1123 377118720
# Specified expected number of particle per micrograph = 20.0
# With radius = 3
# Setting pi = 6.1519088736830675e-06
# minibatch_size=128, epoch_size=5000, num_epochs=10
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function
Traceback (most recent call last):
File "/usr/local/conda/bin/topaz", line 11, in <module>
load_entry_point('topaz-em==0.2.4', 'console_scripts', 'topaz')()
File "/usr/local/conda/lib/python3.7/site-packages/topaz/main.py", line 148, in main
args.func(args)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 695, in main
, save_prefix=save_prefix, use_cuda=use_cuda, output=output)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 577, in fit_epochs
, use_cuda=use_cuda, output=output)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/commands/train.py", line 557, in fit_epoch
metrics = step_method.step(X, Y)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/methods.py", line 103, in step
score = self.model(X).view(-1)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/classifier.py", line 28, in forward
z = self.features(x)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 54, in forward
z = self.features(x)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 272, in forward
y = self.bn(y)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
exponential_average_factor, self.eps)
File "/usr/local/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1623, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED