Topaz hangs on final exposures

seandworkman · October 11, 2025, 8:42pm

Hi all,

It seems this is an issue that’s been seen before, but I am trying to run a Topaz Train job and it hangs with the last few exposures.

I can confirm that our topaz executable (actually using a topaz wrapper script) works fine outside of CS, and I can see using nvidia-smi and htop that there are no topaz processes running anymore. But for instance, with 4489 exposures, the job stalls out with 4475 .mrc files in the preprocessed folder of our CS job directory and goes no further.

We are running CS v4.7.1 and Topaz 0.2.5a installed with conda install topaz=0.2.5 pytorch=1.10.2=py3.6_cuda11.3_cudnn8.2.0_0 mkl=2024.0.0 -c tbepler -c pytorch -c conda-forge.

This is running on a local cluster with Slurm to distribute jobs and we should not be in any way limited by RAM/CPU/GPU availability.

Any help would be great.

EDIT: I should say, I am using the default values for this job beyond specifying particle size and estimated number of particles per image.

seandworkman · October 12, 2025, 11:47pm

I figured it out and it was what people found in other instances. After looking at system logs I was getting OOM errors, even though it was able to process almost all of the images. After setting up custom variables for our Slurm cluster, I was able to get CS to give the job a massive amount of RAM and it completed just fine.