Topaz training MemoryError

Hi, I launched a Topaz (v0.2.3 or v0.2.4) Cross-Validation. However, all of them failed with AssertionError: Subprocess exited with status -9.
Does anyone come across with same problems?
I used CryoSPARC v4.0.3.
Thank you so much.
Tinghai

PS:
Topaz Cross Validation:
Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 93, in cryosparc_compute.run.main
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 794, in run_topaz_wrapper_cross_validation
assert len(tables) > 0, “All subsidiary training jobs failed or were killed.”
AssertionError: All subsidiary training jobs failed or were killed.
Topaz Train:
Traceback (most recent call last):
File “cryosparc_worker/cryosparc_compute/run.py”, line 93, in cryosparc_compute.run.main
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 360, in run_topaz_wrapper_train
utils.run_process(train_command)
File “/data/cryosparc/cryosparc_worker/cryosparc_compute/jobs/topaz/topaz_utils.py”, line 98, in run_process
assert process.returncode == 0, f"Subprocess exited with status {process.returncode} ({str_command})"
AssertionError: Subprocess exited with status 1 (/primary/vari/software/topaz/default-cryosparc/topaz train --num-particles 200 --k-fold 2 --fold 0 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size …)

Please can you post the wrapper script that you configured as Path to Topaz executable.
Have you tried running:

  1. a Topaz Train job inside CryoSPARC?
  2. Topaz training outside CryoSPARC using the same Topaz installation that is referenced in your wrapper script?

Starting micrograph preprocessing by running command /primary/software/topaz/default-cryosparc/topaz preprocess --scale 4 --niters 200 --num-workers 8 -o /home/user/project.
I also tried the Topaz Train job inside CryoSPARC. It shows same error.
I did not try it outside CryoSPARC.

Please can you post text from job.log and the full event log of the failed topaz preprocess job.
You may also want to test

/primary/software/topaz/default-cryosparc/topaz

outside CryoSPARC on some processing tasks and watch for error messages.

Here is the job.log file.

================= CRYOSPARCW ======= 2023-01-06 10:55:32.053004 =========
Project P2 Job J175
Master node064.cm.cluster Port 39002
===========================================================================
========= monitor process now starting main process
MAINPROCESS PID 193765
========= monitor process now waiting for main process
MAIN PID 193765
topaz.run_topaz cryosparc_compute.jobs.jobregister
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
========= sending heartbeat
***************************************************************
Running job on hostname %s node064.cm.cluster
Allocated Resources : {'fixed': {'SSD': False}, 'hostname': 'node064.cm.cluster', 'lane': 'default', 'lane_type': 'node', 'license': False, 'licenses_acquired': 0, 'slots': {'CPU': [8, 9, 10, 11, 12, 13, 14, 15], 'GPU': [1], 'RAM': [1]}, 'target': {'cache_path': '/data/fs1/cryosparc2/', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': [{'id': 0, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 1, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 2, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}, {'id': 3, 'mem': 11721506816, 'name': 'GeForce GTX 1080 Ti'}], 'hostname': 'node064.cm.cluster', 'lane': 'default', 'monitor_port': None, 'name': 'node064.cm.cluster', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]}, 'ssh_str': 'user@node064.cm.cluster', 'title': 'Worker node node064.cm.cluster', 'type': 'node', 'worker_bin_path': '/data/fs1/cryosparc2/cryosparc_worker/bin/cryosparcw'}}
**** handle exception rc
set status to failed
========= main process now complete.
========= monitor process now complete.

And here is the event log.

License is valid.
Launching job on lane default target node064.cm.cluster ...
Running job on master node hostname node064.cm.cluster
[CPU: 84.5 MB]
Job J175 Started
[CPU: 84.5 MB]
Master running v4.0.3, worker running v4.0.3
[CPU: 84.7 MB]
Working in directory: /home/user/user_folder/CryoSPARC/P34/J175
[CPU: 84.7 MB]
Running on lane default
[CPU: 84.7 MB]
Resources allocated:
[CPU: 84.7 MB]
Worker: node064.cm.cluster
[CPU: 84.7 MB]
CPU : [8, 9, 10, 11, 12, 13, 14, 15]
[CPU: 84.7 MB]
GPU : [1]
[CPU: 84.7 MB]
RAM : [1]
[CPU: 84.7 MB]
SSD : False
[CPU: 84.7 MB]
--------------------------------------------------------------
[CPU: 84.7 MB]
Importing job module for job type topaz_train...
[CPU: 210.4 MB]
Job ready to run
[CPU: 210.4 MB]
***************************************************************
[CPU: 210.4 MB]
Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble. Citations: - Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8 - Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (2019) doi: https://doi.org/10.1101/838920 Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE. ***************************************************************
[CPU: 423.7 MB]
Starting Topaz process using version 0.2.3...
[CPU: 423.7 MB]
Random seed used is 1558944723
[CPU: 427.4 MB]
--------------------------------------------------------------
[CPU: 427.4 MB]
Starting preprocessing...
[CPU: 427.5 MB]
Starting micrograph preprocessing by running command /primary/vari/software/topaz/default-cryosparc/topaz preprocess --scale 4 --niters 200 --num-workers 8 -o /home/user/user_folder/CryoSPARC/P34/J175/preprocessed [17304 MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]
[CPU: 427.5 MB]
Preprocessing over 8 processes...
[CPU: 430.5 MB]
Inverting negative staining...
[CPU: 434.6 MB]
Inverting negative staining complete.
[CPU: 434.6 MB]
Micrograph preprocessing command complete.
[CPU: 454.7 MB]
Starting particle pick preprocessing by running command /primary/vari/software/topaz/default-cryosparc/topaz convert --down-scale 4 --threshold -6 -o /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed.txt /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_raw.txt
[CPU: 454.7 MB]
Particle pick preprocessing command complete.
[CPU: 454.7 MB]
Preprocessing done in 15105.236s.
[CPU: 454.7 MB]
--------------------------------------------------------------
[CPU: 454.7 MB]
Starting train-test splitting...
[CPU: 454.7 MB]
Starting dataset splitting by running command /primary/vari/software/topaz/default-cryosparc/topaz train_test_split --number 3460 --seed 1558944723 --image-dir /home/user/user_folder/CryoSPARC/P34/J175/preprocessed /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed.txt
[CPU: 454.7 MB]
# splitting 17303 micrographs with 449689 labeled particles into 13843 train and 3460 test micrographs
[CPU: 454.7 MB]
# writing: /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed_train.txt
[CPU: 454.7 MB]
# writing: /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed_test.txt
[CPU: 454.7 MB]
# writing: /home/user/user_folder/CryoSPARC/P34/J175/image_list_train.txt
[CPU: 454.7 MB]
# writing: /home/user/user_folder/CryoSPARC/P34/J175/image_list_test.txt
[CPU: 454.7 MB]
Dataset splitting command complete.
[CPU: 454.7 MB]
Train-test splitting done in 619.567s.
[CPU: 454.7 MB]
--------------------------------------------------------------
[CPU: 454.7 MB]
Starting training...
[CPU: 454.7 MB]
Starting training by running command /primary/vari/software/topaz/default-cryosparc/topaz train --train-images /home/user/user_folder/CryoSPARC/P34/J175/image_list_train.txt --train-targets /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed_train.txt --test-images /home/user/user_folder/CryoSPARC/P34/J175/image_list_test.txt --test-targets /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed_test.txt --num-particles 200 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 8 --cross-validation-seed 1558944723 --radius 32 --num-particles 200 --device 1 --no-pretrained --save-prefix=/home/user/user_folder/CryoSPARC/P34/J175/models/model -o /home/user/user_folder/CryoSPARC/P34/J175/train_test_curve.txt
[CPU: 454.7 MB]
# Loading model: resnet8
[CPU: 454.7 MB]
# Model parameters: units=32, dropout=0.0, bn=on
[CPU: 454.7 MB]
# Receptive field: 71
[CPU: 454.7 MB]
# Using device=1 with cuda=True
[CPU: 454.7 MB]
# Loaded 13843 training micrographs with 359603 labeled particles
[CPU: 454.7 MB]
# Loaded 3460 test micrographs with 90086 labeled particles
[CPU: 454.7 MB]
# source split p_observed num_positive_regions total_regions
[CPU: 454.7 MB]
# 0 train 0.0549 1119066428 20392400160
[CPU: 454.7 MB]
# 0 test 0.055 280371738 5096995200
[CPU: 454.7 MB]
# Specified expected number of particle per micrograph = 200.0
[CPU: 454.7 MB]
# With radius = 32
[CPU: 454.7 MB]
# Setting pi = 0.43567394373845986
[CPU: 454.7 MB]
# minibatch_size=128, epoch_size=5000, num_epochs=10
[CPU: 454.7 MB]
Traceback (most recent call last):
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/default-cryosparc/topaz", line 11, in <module>
[CPU: 454.7 MB]
load_entry_point('topaz-em==0.2.3', 'console_scripts', 'topaz')()
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/topaz-0.2.3-cryosparc/anaconda3/envs/topaz/lib/python3.8/site-packages/topaz/main.py", line 146, in main
[CPU: 454.7 MB]
args.func(args)
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/topaz-0.2.3-cryosparc/anaconda3/envs/topaz/lib/python3.8/site-packages/topaz/commands/train.py", line 675, in main
[CPU: 454.7 MB]
train_iterator,test_iterator = make_data_iterators(train_images, train_targets,
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/topaz-0.2.3-cryosparc/anaconda3/envs/topaz/lib/python3.8/site-packages/topaz/commands/train.py", line 491, in make_data_iterators
[CPU: 454.7 MB]
sampler = StratifiedCoordinateSampler(labels, size=epoch_size*minibatch_size
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/topaz-0.2.3-cryosparc/anaconda3/envs/topaz/lib/python3.8/site-packages/topaz/utils/data/sampler.py", line 92, in __init__
[CPU: 454.7 MB]
P,N = enumerate_pn_coordinates(group)
[CPU: 454.7 MB]
File "/primary/vari/software/topaz/topaz-0.2.3-cryosparc/anaconda3/envs/topaz/lib/python3.8/site-packages/topaz/utils/data/sampler.py", line 20, in enumerate_pn_coordinates
[CPU: 454.7 MB]
N = np.zeros(N_size, dtype=[('image', np.uint32), ('coord', np.uint32)])
[CPU: 454.7 MB]
MemoryError: Unable to allocate 144. GiB for an array with shape (19273333732,) and data type [('image', '<u4'), ('coord', '<u4')]
[CPU: 454.8 MB]
Traceback (most recent call last): File "cryosparc_worker/cryosparc_compute/run.py", line 93, in cryosparc_compute.run.main File "/data/fs1/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py", line 360, in run_topaz_wrapper_train utils.run_process(train_command) File "/data/fs1/cryosparc2/cryosparc_worker/cryosparc_compute/jobs/topaz/topaz_utils.py", line 98, in run_process assert process.returncode == 0, f"Subprocess exited with status {process.returncode} ({str_command})" AssertionError: Subprocess exited with status 1 (/primary/vari/software/topaz/default-cryosparc/topaz train --train-images /home/user/user_folder/CryoSPARC/P34/J175/image_list_train.txt --train-targets /home/user/user_folder/CryoSPARC/P34/J175/topaz_particles_processed_train.txt --test-im…)

I am not sure how to run the Topaz outside the CryoSPARC.
Thank you so much.

This may be the problem:

It is possible that another forum user who is familiar with Topaz can suggest how to overcome this error. Or you could look for help at Issues · tbepler/topaz · GitHub

Thank you so much.
I will try that.