Topaz train failure

Hello cryoSPARCers,
While trying to run Topaz train with all curated exposures and particles from heterogenous refinement, there is an error as below
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “/cryosparc/v3/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 357, in run_topaz_wrapper_train

Also tried using particles after an extract micrograph job from the hetero refine- that topaz run failed with the same error.
Also tried using micrographs after patch CTF (multi) & particles from the hetero refine- that topaz run failed with the above error.

Does anyone have a fix for this?

regards
Ani

Hi Ani,
I’m not sure if it’s the same problem, but you can try to take a look at my reply on this thread: CryoSPARC v2.16 beta - Topaz Train error
Hopefully that solves it for you!

Sincerely,
Martin

Thanks, Martin, will try your script. The bizarre part is the topaz runs did work previously with the current dataset and other datasets (for n no. of images). It stopped working the last few times with the current dataset.
Can this happen?

With regards,
Ani

Actually, looking back at the error message I have got earlier, and which could be fixed by renaming the preprocessed micrographs, it seems that you’re getting a different error. This is what I got:

" File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “/opt/bioxray/programs/cryosparc2/cryosparc2_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 347, in run_topaz_wrapper_train
assert len(glob.glob(os.path.join(model_dir, ‘*’))) > 0, “Training failed, no models were created.”
AssertionError: Training failed, no models were created."

So I get the error earlier in run_topaz.py than you.
However, assuming that it’s due to the same bug, it wouldn’t be surprising if the job works using one particle object (eg. directly from picker) but not another (eg. from subsequent inspection) as it seems that the job has a non-sensical way of choosing which micrograph versions the particles originate from, depending on the job that the particle object was taken from.

Thanks Martin! This bug is interesting :slight_smile:
I ran a topaz train in the old workspace where the movies were originally imported
and eventually curated. I used the hetero refine model from a subset of images from another workspace to topaz train in the above said older workspace. This job worked.
The suffix fractions_rigid_aligned.mrc vs …fractions_patch_aligned_doseweighted.mrc did not matter in this case.

@Ani Can you post additional lines that I suspect followed

@wtempel
Please see below:

[CPU: 6.4 MB] # Loading model: resnet8

[CPU: 7.7 MB] # Model parameters: units=32, dropout=0.0, bn=on

[CPU: 8.0 MB] # Receptive field: 71

[CPU: 8.0 MB] # Using device=0 with cuda=True

[CPU: 8.0 MB] # Loaded 5384 training micrographs with 111880 labeled particles

[CPU: 8.0 MB] # Loaded 1346 test micrographs with 27503 labeled particles

[CPU: 8.0 MB] # source split p_observed num_positive_regions total_regions

[CPU: 8.0 MB] # 0 train 0.000409 3244520 7931278080

[CPU: 8.0 MB] # 0 test 0.000402 797587 1982819520

[CPU: 8.1 MB] # Specified expected number of particle per micrograph = 600.0

[CPU: 8.1 MB] # With radius = 3

[CPU: 8.1 MB] # Setting pi = 0.011811665037471488

[CPU: 8.1 MB] # minibatch_size=128, epoch_size=5000, num_epochs=10

[CPU: 5.1 MB] /xtal/apps/cryosparc/topaz.sh: line 20: 305258 Killed topaz $@

[CPU: 7.2 MB]
Training command complete.

[CPU: 7.2 MB] Training done in 8061.346s.

[CPU: 7.3 MB] --------------------------------------------------------------

[CPU: 9.6 MB] Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “/xtal/apps/cryosparc/v3/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 357, in run_topaz_wrapper_train
assert len(glob.glob(os.path.join(model_dir, ‘*’))) > 0, “Training failed, no models were created.”
AssertionError: Training failed, no models were created.

What might have sent the kill signal to topaz? Is the worker node under cluster resource management?

@wtempel
Could you please elaborate the question a bit more- so I can understand?
The topaz runs worked at other workspaces, just in this workspace 5 topaz runs failed.
What could be the reason?

@Ani

I think you might use too many micrographs and particles.
See your log.

[CPU: 8.0 MB] # Loaded 5384 training micrographs with 111880 labeled particles

[CPU: 8.0 MB] # Loaded 1346 test micrographs with 27503 labeled particles

Your total micrographs are 6730
Particles are 139383.

So this might be too heavy for training depend on workstation powers.

Jinseo

@Ani The log indicates that the topaz process was killed. Because we would expect different log entries if the kill signal had been sent by cryoSPARC, we are looking for a different source of that signal.
For example, resource management software on some computer clusters might kill a job that runs for longer than the allocated time or that uses more than the allocated memory. I do not know whether such management software runs on your cryoSPARC worker(s); the kill signal could originate elsewhere.
Since you mentioned that the problem is confined to the current workspace: are both the current and old workspaces part of the same cryoSPARC project? If so, is there a pair of topaz jobs that completed in the old workspace and failed in the new, but are otherwise identical? Did all the topaz jobs run on the same worker node? Worker information is available in a job’s Overview tab, after pushing Show from top.

1 Like