Topaz Extract completes, but not really

I have an issue with the Topaz Extract job. After running into the “too many arguments” problem others have reported, I split my dataset into 4000 or 5000 micrograph batches, and I am running Topaz Extract jobs on them simultaneously with the same pretrained model. Normally, they take about 30-45 minutes to run, but sometimes I get a memory error and the job fails after up to 12 hours. This is not a problem, as I can restart the job when I notice it takes too long. However, I have failed to realize that a lot of the jobs that complete and take no longer than expected also have issues: sometimes the job stops picking particles before reaching the end of the dataset (see attached images). The number of picked particles does not correlate with any parameter except the index of the micrographs.



Upon inspecting the jobs where I got incomplete picking, I see a memory error as well:



-------- Submission command: 
sbatch /home/cryoemuppsala/cryosparc_projects/CS-/J268/queue_sub_script.sh

-------- Cluster Job ID: 
474681

-------- Queued on cluster at 2023-11-07 07:30:20.304041

-------- Cluster job status at 2023-11-07 07:30:31.004047 (1 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            474681    cryoem cryospar cryospar  R       0:11      1 a012
[CPU:  164.4 MB]

Job J268 Started
[CPU:  164.4 MB]

Master running v4.3.1, worker running v4.3.1
[CPU:  164.5 MB]

Working in directory: /home/cryoemuppsala/cryosparc_projects/CS-/J268
[CPU:  164.5 MB]

Running on lane Ampere
[CPU:  164.5 MB]

Resources allocated: 
[CPU:  164.5 MB]

  Worker:  Ampere
[CPU:  164.5 MB]

  CPU   :  [0, 1, 2, 3, 4, 5, 6, 7]
[CPU:  164.5 MB]

  GPU   :  [0]
[CPU:  164.5 MB]

  RAM   :  [0]
[CPU:  164.5 MB]

  SSD   :  False
[CPU:  164.5 MB]

--------------------------------------------------------------
[CPU:  164.5 MB]

Importing job module for job type topaz_extract...
[CPU:  221.2 MB]

Job ready to run
[CPU:  221.2 MB]

***************************************************************
[CPU:  221.2 MB]

Topaz is a particle detection tool created by Tristan Bepler and Alex J. Noble.
Citations:
- Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153-1160 (2019) doi:10.1038/s41592-019-0575-8
- Bepler, T., Noble, A.J., Berger, B. Topaz-Denoise: general deep denoising models for cryoEM. bioRxiv 838920 (2019) doi: https://doi.org/10.1101/838920

Structura Biotechnology Inc. and cryoSPARC do not license Topaz nor distribute Topaz binaries. Please ensure you have your own copy of Topaz licensed and installed under the terms of its GNU General Public License v3.0, available for review at: https://github.com/tbepler/topaz/blob/master/LICENSE.
***************************************************************

[CPU:  231.0 MB]

Starting Topaz process using version 0.2.5...
[CPU:  231.0 MB]

Skipping preprocessing.
[CPU:  231.0 MB]

Using preprocessed micrographs from  J213/preprocessed
[CPU:  231.2 MB]

--------------------------------------------------------------
[CPU:  231.2 MB]

Starting preprocessing...

[CPU:  231.2 MB]

Starting micrograph preprocessing by running command /home/opt/wrappers/topaz preprocess --scale 4 --niters 200 --num-workers 8 -o /home/cryoemuppsala/cryosparc_projects/CS-/J213/preprocessed [3943 MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]

[CPU:  231.2 MB]

Preprocessing over 8 processes...
[CPU:  190.9 MB]

multiprocessing.pool.RemoteTraceback:
[CPU:  191.1 MB]

"""
[CPU:  191.1 MB]

Traceback (most recent call last):
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 119, in worker
[CPU:  191.3 MB]

result = (True, func(*args, **kwds))
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 64, in __call__
[CPU:  191.3 MB]

x = downsample(x, self.scale)
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/utils/image.py", line 18, in downsample
[CPU:  191.3 MB]

F = np.fft.rfft2(x)
[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfft2
[CPU:  191.3 MB]

multiprocessing.pool.RemoteTraceback:
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 1162, in rfft2
[CPU:  191.3 MB]

"""
[CPU:  191.3 MB]

Traceback (most recent call last):
[CPU:  191.3 MB]

return rfftn(a, s, axes, norm)
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 119, in worker
[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfftn
[CPU:  191.3 MB]

result = (True, func(*args, **kwds))
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 1121, in rfftn
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 64, in __call__
[CPU:  191.3 MB]

a = rfft(a, s[-1], axes[-1], norm)
[CPU:  191.3 MB]

x = downsample(x, self.scale)
[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfft
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/utils/image.py", line 18, in downsample
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 371, in rfft
[CPU:  191.3 MB]

F = np.fft.rfft2(x)
[CPU:  191.3 MB]

output = _raw_fft(a, n, axis, True, True, inv_norm)
[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfft2
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 74, in _raw_fft
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 1162, in rfft2
[CPU:  191.3 MB]

r = pfi.execute(a, is_real, is_forward, fct)
[CPU:  191.3 MB]

MemoryError
[CPU:  191.3 MB]

return rfftn(a, s, axes, norm)
[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfftn
[CPU:  191.3 MB]

"""
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 1121, in rfftn
[CPU:  191.3 MB]

a = rfft(a, s[-1], axes[-1], norm)
[CPU:  191.3 MB]


[CPU:  191.3 MB]

File "<__array_function__ internals>", line 6, in rfft
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 371, in rfft
[CPU:  191.3 MB]

The above exception was the direct cause of the following exception:
[CPU:  191.3 MB]

output = _raw_fft(a, n, axis, True, True, inv_norm)
[CPU:  191.3 MB]


[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/numpy/fft/_pocketfft.py", line 74, in _raw_fft
[CPU:  191.3 MB]

Traceback (most recent call last):
[CPU:  191.3 MB]

r = pfi.execute(a, is_real, is_forward, fct)
[CPU:  191.3 MB]

MemoryError
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  191.3 MB]

"""
[CPU:  191.3 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  191.3 MB]


[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  191.3 MB]

The above exception was the direct cause of the following exception:
[CPU:  191.3 MB]

args.func(args)
[CPU:  191.3 MB]


[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 129, in main
[CPU:  191.3 MB]

Traceback (most recent call last):
[CPU:  191.3 MB]

for name in pool.imap_unordered(process, paths):
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  191.3 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  191.3 MB]

raise value
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  191.3 MB]

MemoryError
[CPU:  191.3 MB]

args.func(args)
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 129, in main
[CPU:  191.3 MB]

for name in pool.imap_unordered(process, paths):
[CPU:  191.3 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  191.3 MB]

raise value
[CPU:  191.3 MB]

MemoryError
[CPU:  190.4 MB]

multiprocessing.pool.RemoteTraceback:
[CPU:  190.6 MB]

"""
[CPU:  190.6 MB]

Traceback (most recent call last):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 119, in worker
[CPU:  190.6 MB]

result = (True, func(*args, **kwds))
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 61, in __call__
[CPU:  190.6 MB]

x = np.array(load_image(path), copy=False).astype(np.float32)
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/utils/data/loader.py", line 105, in load_image
[CPU:  190.6 MB]

image = load_mrc(path, standardize=standardize)
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/utils/data/loader.py", line 50, in load_mrc
[CPU:  190.6 MB]

content = f.read()
[CPU:  190.6 MB]

MemoryError
[CPU:  190.6 MB]

"""
[CPU:  190.6 MB]


[CPU:  190.6 MB]

The above exception was the direct cause of the following exception:
[CPU:  190.6 MB]


[CPU:  190.6 MB]

Traceback (most recent call last):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  190.6 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  190.6 MB]

args.func(args)
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 129, in main
[CPU:  190.6 MB]

for name in pool.imap_unordered(process, paths):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  190.6 MB]

raise value
[CPU:  190.6 MB]

MemoryError
[CPU:  190.4 MB]

Traceback (most recent call last):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  190.6 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  190.6 MB]

args.func(args)
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 129, in main
[CPU:  190.6 MB]

for name in pool.imap_unordered(process, paths):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  190.6 MB]

raise value
[CPU:  190.6 MB]

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f461fe70f28>'. Reason: 'PicklingError("Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError",)'
[CPU:  190.6 MB]

Traceback (most recent call last):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  190.6 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  190.6 MB]

args.func(args)
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/normalize.py", line 129, in main
[CPU:  190.6 MB]

for name in pool.imap_unordered(process, paths):
[CPU:  190.6 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  190.6 MB]

raise value
[CPU:  190.6 MB]

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7facf37ff400>'. Reason: 'PicklingError("Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError",)'
[CPU:  190.8 MB]

Preprocessing command complete.

[CPU:  187.0 MB]

Preprocessing done in 1809.205s.
[CPU:  187.0 MB]

Inverting negative staining...
[CPU:  188.0 MB]

Inverting negative staining complete.

[CPU:  188.0 MB]

--------------------------------------------------------------
[CPU:  188.0 MB]

Starting extraction...

[CPU:  188.1 MB]

Starting extraction by running command /home/opt/wrappers/topaz extract --radius 7 --threshold -6 --up-scale 4 --assignment-radius -1 --min-radius 5 --max-radius 100 --step-radius 5 --num-workers 8 --device 0 --model /home/cryoemuppsala/cryosparc_projects/CS-/J213/models/model_epoch34.sav -o /home/cryoemuppsala/cryosparc_projects/CS-/J268/topaz_particles_prediction.txt [3943 MICROGRAPH PATHS EXCLUDED FOR LEGIBILITY]

[CPU:  188.8 MB]

Traceback (most recent call last):
[CPU:  188.8 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/bin/topaz", line 33, in <module>
[CPU:  188.8 MB]

sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
[CPU:  188.8 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/main.py", line 148, in main
[CPU:  188.8 MB]

args.func(args)
[CPU:  188.8 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/extract.py", line 288, in main
[CPU:  188.8 MB]

for path,score,coords in nms_iterator(stream, radius, threshold, pool=pool):
[CPU:  188.8 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/site-packages/topaz/commands/extract.py", line 79, in nms_iterator
[CPU:  188.8 MB]

for name,score,coords in pool.imap_unordered(process, scores):
[CPU:  188.8 MB]

File "/home/opt/miniconda3/envs/topaz-0.2.5/lib/python3.6/multiprocessing/pool.py", line 735, in next
[CPU:  188.8 MB]

raise value
[CPU:  188.8 MB]

FileNotFoundError: [Errno 2] No such file or directory: '/home/cryoemuppsala/cryosparc_projects/CS-/J213/preprocessed/017569311607672022970_FoilHole_2020105_Data_2013980_2013982_20231102_191229_fractions_patch_aligned_doseweighted.mrc'
[CPU:  188.8 MB]

Extraction command complete.

[CPU:  188.8 MB]

Starting particle pick thresholding by running command /home/opt/wrappers/topaz convert -t 0 -o /home/cryoemuppsala/cryosparc_projects/CS-/J268/topaz_particles_prediction_thresholded.txt /home/cryoemuppsala/cryosparc_projects/CS-/J268/topaz_particles_prediction.txt

[CPU:  188.8 MB]

Particle pick thresholding command complete.

[CPU:   1.21 GB]

Extraction done in 139.818s.
[CPU:   1.21 GB]

--------------------------------------------------------------
[CPU:   1.21 GB]

Finished Topaz process in 1949.93s
[CPU:   1.21 GB]

--------------------------------------------------------------
[CPU:   1.21 GB]

Compiling job outputs...
[CPU:  216.0 MB]

Passing through outputs for output group micrographs from input group micrographs
[CPU:  216.0 MB]

This job outputted results ['micrograph_blob']
[CPU:  216.0 MB]

  Loaded output dset with 3943 items
[CPU:  216.0 MB]

Passthrough results ['ctf', 'mscope_params', 'movie_blob', 'background_blob', 'micrograph_thumbnail_blob_1x', 'micrograph_thumbnail_blob_2x', 'ctf_stats', 'rigid_motion', 'spline_motion', 'micrograph_blob_non_dw']
[CPU:  216.0 MB]

  Loaded passthrough dset with 3943 items
[CPU:  216.0 MB]

  Intersection of output and passthrough has 3943 items
[CPU:  216.1 MB]

Checking outputs for output group micrographs
[CPU:  224.5 MB]

Updating job size...
[CPU:  224.5 MB]

Exporting job and creating csg files...
[CPU:  224.6 MB]

***************************************************************
[CPU:  224.6 MB]

Job complete. Total time 1950.88s

This can of course be avoided by inspecting the picks (which is good practice anyway), and the error only seems ho happen when I run multiple Topaz Extract jobs at the same time. I still thought to mention it as in my opinion the job should not complete if it did not manage to process all the input files.