When deploying Topaz 0.3 on our cluster, I discovered inefficient logic in how CryoSPARC selects the number of CPUs for the Topaz Train job (and it probably applies also to other Topaz-based jobs).
You have two settings controlling the number of CPUs/threads/processes for Topaz: num_distribute=2 and num_workers=4.
With default values, respectively 2 and 4, the number of CPUs for slurm jobs is specified based on num_distribute*num_workers, so 8. However, during the training step, only the num_workers value is passed as the Topaz num_workers parameter, which is a waste of resources and limits the performance.
Also, the preprocessing step correctly supports num_threads parameters, so I don’t know why the split between multiple processes (maybe it’s historical). I did a quick check on the EMPIAR-10025 extensive validation subset. Differences are very evident in the new Topaz, but the current behavior seems unnecessary, also for 0.2.5.
It was a Topaz Train CryoSPARC job with par_diam=10, num_particles=2(which is probably a wrong settign for this dataset, but I didn’t want to get any meaningful results, just to measure time) and num_distribute, compute_num_workers set as above. Input was from bench/extract_template_picks/1 job from extrensive validation on EMPIAR 10025
I’ll conduct more tests of full Topaz workflow on larger dataset to establish reasonable defaults for our users. I can share the results here if you want.
Here are the full timing results.
Training was done on 100 micrographs from EMPIAR-288, extracting on 2653 micrographs. CryoSPARC version 5.0.4.
I also included larger job configuration with 16 threads as it is a best fit for out setup with worker nodes with 16 CPU cores per GPU (EPYC Rome and A100). I didn’t change any settings other than num_distribute and num_workers.
Those results might be partially affected by the nature of the setup - shared HPC cluster with Lustre-based storage, but the trends are significant.
Strangely, in Train job, I observed also differences in precision obtained with different configurations with topaz 0.3.20 (see attachments). This might however be be more of a Topaz issue than with CS wrapper (unless they are presented/calculated incorrectly).
I’m a system administrator, not e Cryo-EM expert, so it’s hard for me to properly evaluate this.
Differences persist over multiple runs, so it’s unlikely it’s a randomness effect in training.
On the topic of training precision, when doing a training run, we would advise a couple changes to your setup to ensure that the training precision comparisons are fair. The particles present in bench/extract_template_picks/1 will contain junk including small particles of gold which might, in conjunction with your parameter choices, lead to inconsistent model quality.
Recommended changes would include:
Setting par_diam = 150 and num_particles = 500
Curate the particles a bit before training by queuing a Select 2D job off of bench/template_class_2D_100/1 and ensuring that only good T20S classes are selected like this:
Tests on EMPIAR-288 were already using more reasonable settings. Nevertheless here are the results from what you suggested and they are very similar. All parameters except those specified in the table below, par_diam, num_particles and seed, which I set the same for all jobs, were kept default.
Looks like 0.3.20 heavily overfits the model already after the first epoch.
Nothing particular. Just wanted to publish a new version for our users. Once in a while, someone hits the training set size limit and discovers the limitation of topaz 0.2 (whether it’s a good idea to have a training set larger than, iirc, 20k, is a different story).