Hi all,
It seems this is an issue that’s been seen before, but I am trying to run a Topaz Train job and it hangs with the last few exposures.
I can confirm that our topaz executable (actually using a topaz wrapper script) works fine outside of CS, and I can see using nvidia-smi and htop that there are no topaz processes running anymore. But for instance, with 4489 exposures, the job stalls out with 4475 .mrc files in the preprocessed folder of our CS job directory and goes no further.
We are running CS v4.7.1 and Topaz 0.2.5a installed with conda install topaz=0.2.5 pytorch=1.10.2=py3.6_cuda11.3_cudnn8.2.0_0 mkl=2024.0.0 -c tbepler -c pytorch -c conda-forge.
This is running on a local cluster with Slurm to distribute jobs and we should not be in any way limited by RAM/CPU/GPU availability.
Any help would be great.
EDIT: I should say, I am using the default values for this job beyond specifying particle size and estimated number of particles per image.