Issue while running Topaz train

kyestachowski · March 7, 2020, 3:37pm

I am getting the following runtime warning. The job has been running for 60 hrs now and is still not done. I gave it a set of 1400 particles.

License is valid.
Launching job on lane gpu_4 target gpu_4 ...
Launching job on cluster gpu_4

====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash

#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /opt/cryosparc2_worker/bin/cryosparcw run --project P4 --job J86 --master_hostname gpu01 --master_command_core_port 39002 > /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/job.log 2>&1             - the complete command string to run the job
## 4            - the number of CPUs needed
## 1            - the number of GPUs needed.
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 8.0             - the amount of RAM needed in GB
## /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86        - absolute path to the job directory
## /ssd/proc/cryosparc/kstachowski/TRAP/P4    - absolute path to the project dir
## /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/job.log   - absolute path to the log file for the job
## /opt/cryosparc2_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P4 --job J86 --master_hostname gpu01 --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## P4        - uid of the project
## J86            - uid of the job
## stachowski.7@osu.edu        - name of the user that created the job (may contain spaces)
## stachowski.7@osu.edu - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:
##SBATCH --ntasks=1

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4

#SBATCH --job-name=cryosparc2_stachowski.7@osu.edu_P4_J86
#SBATCH --gres=gpu:4
#SBATCH --partition=gpu
#SBATCH -o /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/job.log
#SBATCH -e /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/job.log

unset SLURM_NTASKS_PER_NODE
unset SLURM_NTASKS


/opt/cryosparc2_worker/bin/cryosparcw run --project P4 --job J86 --master_hostname gpu01 --master_command_core_port 39002 > /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/job.log 2>&1 
==========================================================================
==========================================================================

-------- Submission command: 
sbatch /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/queue_sub_script.sh

-------- Cluster Job ID: 
5

-------- Queued on cluster at 2020-03-05 07:53:50.391611

-------- Job status at 2020-03-05 07:53:50.436162
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 5       gpu cryospar cryospar PD       0:00      1 (None)

[CPU: 95.8 MB]   Project P4 Job J86 Started
[CPU: 95.8 MB]   Master running v2.14.0, worker running v2.14.0
[CPU: 95.9 MB]   Running on lane gpu_4
[CPU: 95.9 MB]   Resources allocated: 
[CPU: 95.9 MB]     Worker:  gpu_4
[CPU: 95.9 MB]     CPU   :  [0, 1, 2, 3]
[CPU: 95.9 MB]     GPU   :  [0]
[CPU: 95.9 MB]     RAM   :  [0]
[CPU: 95.9 MB]     SSD   :  False
[CPU: 95.9 MB]   --------------------------------------------------------------
[CPU: 95.9 MB]   Importing job module for job type topaz_train...
[CPU: 190.9 MB]  Job ready to run
[CPU: 190.9 MB]  ***************************************************************
[CPU: 190.9 MB]  Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble.
Citations:
- Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8
- Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (209) doi: https://doi.org/10.1101/838920

Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE.
***************************************************************
[CPU: 192.4 MB]  Starting Topaz process using version 0.2.3...
[CPU: 192.4 MB]  Random seed used is 1316918258
[CPU: 192.4 MB]  --------------------------------------------------------------
[CPU: 192.4 MB]  Starting preprocessing...
[CPU: 192.4 MB]  Starting micrograph preprocessing by running command /usr/local/bin/topaz preprocess --scale 4 --niters 200 --num-workers 4 -o /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/preprocessed [MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]
[CPU: 192.4 MB]  Preprocessing over 4 processes...
[CPU: 192.6 MB]  Inverting negative staining...
[CPU: 192.6 MB]  Inverting negative staining complete.
[CPU: 192.6 MB]  Micrograph preprocessing command complete.
[CPU: 192.6 MB]  Starting particle pick preprocessing by running command /usr/local/bin/topaz convert --down-scale 4 --threshold 0 -o /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed.txt /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_raw.txt
[CPU: 192.6 MB]  Particle pick preprocessing command complete.
[CPU: 192.6 MB]  Preprocessing done in 40.045s.
[CPU: 192.6 MB]  --------------------------------------------------------------
[CPU: 192.6 MB]  Starting train-test splitting...
[CPU: 192.6 MB]  Starting dataset splitting by running command /usr/local/bin/topaz train_test_split --number 12 --seed 1316918258 --image-dir /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/preprocessed /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed.txt
[CPU: 192.6 MB]  # splitting 62 micrographs with 1488 labeled particles into 50 train and 12 test micrographs
[CPU: 192.6 MB]  # writing: /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed_train.txt
[CPU: 192.6 MB]  # writing: /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed_test.txt
[CPU: 192.6 MB]  # writing: /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/image_list_train.txt
[CPU: 192.6 MB]  # writing: /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/image_list_test.txt
[CPU: 192.6 MB]  
Dataset splitting command complete.
[CPU: 192.6 MB]  Train-test splitting done in 0.810s.
[CPU: 192.6 MB]  --------------------------------------------------------------
[CPU: 192.6 MB]  Starting training...
[CPU: 192.6 MB]  Starting training by running command /usr/local/bin/topaz train --train-images /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/image_list_train.txt --train-targets /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed_train.txt --test-images /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/image_list_test.txt --test-targets /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/topaz_particles_processed_test.txt --num-particles 800 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 1000 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 4 --cross-validation-seed 1316918258 --device 0 --no-pretrained --save-prefix=/ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/models/model -o /ssd/proc/cryosparc/kstachowski/TRAP/P4/J86/train_test_curve.txt
[CPU: 192.6 MB]  # Loading model: resnet8
[CPU: 192.6 MB]  # Model parameters: units=32, dropout=0.0, bn=on
[CPU: 192.6 MB]  # Receptive field: 71
[CPU: 192.6 MB]  # Using device=0 with cuda=True
[CPU: 192.6 MB]  # Loaded 50 training micrographs with 1264 labeled particles
[CPU: 192.6 MB]  # Loaded 12 test micrographs with 224 labeled particles
[CPU: 192.6 MB]  # source	split	p_observed	num_positive_regions	total_regions
[CPU: 192.6 MB]  # 0	train	0.000498	36656	73656000
[CPU: 192.6 MB]  # 0	test	0.000367	6496	17677440
[CPU: 192.6 MB]  # Specified expected number of particle per micrograph = 800.0
[CPU: 192.6 MB]  # With radius = 3
[CPU: 192.6 MB]  # Setting pi = 0.0157488867166
[CPU: 192.6 MB]  # minibatch_size=128, epoch_size=5000, num_epochs=1000
[CPU: 192.6 MB]  RuntimeWarning: overflow encountered in exp

jyoo · March 9, 2020, 4:33pm

Hi @kyestachowski,

As the job is training for 1000 epochs, it is expected to take a long time. Try setting the number of epochs to a much smaller value such as 10 or 20 and then increase the number of epochs if it seems that it would help. This could be discerned by observing the second plot outputted at the end of the job.

Since the current job is training on a small set of particles for a large number of epochs, the training is likely to overfit so it will probably be for the best to kill the current job and run a new one with a smaller number of epochs.

Regards,
Jay Yoo