I’m trying to run a Topaz extract job and run into the following error. I used about 9k particles for Topaz train and it ran fine. I have a pretty large dataset (>13k images). I’m wondering if that’s what causing Topaz to fail. Any ideas? Maybe @alexjamesnoble has some input on this?
[ CPU: 226.7 MB] Traceback (most recent call last):
File "cryosparc2_worker/cryosparc2_compute/run.py", line 85, in cryosparc2_compute.run.main
File "cryosparc2_compute/jobs/topaz/run_topaz.py", line 1090, in run_topaz_wrapper_extract
utils.run_process(extract_command)
File "cryosparc2_compute/jobs/topaz/topaz_utils.py", line 37, in run_process
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True, universal_newlines=newlines)
File "/home/vamsee/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/subprocess.py", line 394, in __init__
errread, errwrite)
File "/home/vamsee/software/cryosparc/cryosparc2_worker/deps/anaconda/lib/python2.7/subprocess.py", line 1047, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
Repeated the steps and was able to reproduce the error. Still unsure why this is happening. The Topaz extract jobs start fine and runs for a little bit too. It fails after a certain amount of time.
So, apparently, splitting the micrographs into 4 sets (~3500 each) is okay for Topaz to handle. Not sure what the upper limit is but it definitely fails after 13k images, probably much sooner.
Sorry for the delay. Judging by the Cryosparc traceback, this looks like a Cryosparc file handling issue, not a Topaz issue. If you run the Topaz command shown in the Cryosparc run (with proper changes to the micrographs list), does Topaz work?
This is caused by the command calling Topaz becoming too long due to the number of micrographs designated per thread. This is just a limitation of the subprocess module. There are a few ways to circumvent this issue:
Split the dataset into splits using the Exposure Sets Tool job and then infer from each of the splits.
Create more threads to decrease the number of micrographs per thread. This can be done by increasing the Number of parallel threads parameter. This may cause many threads to be created so if performance issues begin to arise, it is recommended to decrease the Number of CPUs parameter accordingly.
No I haven’t tried doing what you suggested. I’ll give that a shot too and report back. As @jyoo suggested, it is a known limitation of the subprocess module. I was however able to split the dataset into 4 and Topaz extract worked like a charm.
Hi @jyoo - this is a frustrating error to encounter after running Topaz extract for a few hours. It should be possible for cryosparc to detect the number of input micrographs and split the dataset accordingly - or at least run a pre-check to determine the number of micrographs and fail before starting the job, no?
Hi @olibclarke, if it is of any help, I have had luck splitting into less than 5k micrographs generally. Anything above that seems iffy but 5k has worked every time.
Topaz can read paths to micrographs from a text file, which makes the argument list to the command much simpler (point topaz to the text file containing micrograph paths). This seems like something fixable in cryosparc, and it would be much more user-friendly than having to split the dataset.
Actually, according to the commands’ help messages, it seems that this feature only exists for the training part, not for picking, in topaz version 0.2.4:
$ topaz train --help
[...]
--train-images TRAIN_IMAGES
path to file listing the training images. also accepts
directory path from which all images are loaded.
[...]
$ topaz extract --help
[...]
positional arguments:
paths paths to image files for processing
[...]
But in version 0.2.5 this help line says:
[...]
positional arguments:
paths paths to image files for processing, can also be streamed from stdin
[...]
Not sure exactly what this means, and I don’t have version 0.2.5 installed to test this (I read the help strings from the GitHub repo). But this seems like there is a way other than passing all paths as command-line arguments (and hitting the limit from the shell).
Ah, I just found out that topaz now has a lot more documentation (than back when I first used it). This might be helpful to you, check the documentation for the train and extract commands: Topaz Commands — Topaz 0.2.5 documentation
The new default settings for topaz (discussed here) are very helpful! But every time I use topaz on a large dataset, I invariably forget to split it beforehand and systematically run into this infamous Argument list too long error.
I can always work around this error by splitting the exposures into random subsets of 5000 exposures (I have not tried larger subsets, and I suspect the max number of arguments depends on the configuration of the underlying Linux system anyway). But this trips up every newcomer I teach how to use topaz from within CryoSPARC, and since the error message doesn’t suggest a solution these people remain puzzled until I explain what the problem is and how to fix it.