Topaz train job fails in v.3.3.1 but worked in v3.2.0

jcheung · March 28, 2022, 4:56pm

I cloned a Topaz train job that completed without any issues in v3.2.0 but it now fails in v.3.3.1 with the following error:

“image_list_train.txt does not exist”

and then the process ends with the following message.

File “cryosparc_worker/cryosparc_compute/run.py”, line 85, in cryosparc_compute.run.main
File “/srv/cryosparc/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 357, in run_topaz_wrapper_train
assert len(glob.glob(os.path.join(model_dir, ‘*’))) > 0, “Training failed, no models were created.”
AssertionError: Training failed, no models were created

Has anyone encountered this issue? From what I can tell, Topaz is installed in a python 3.6 environment.

wtempel · March 31, 2022, 10:25pm

@jcheung The full content of the Overview tab (Show from top) of the job may include additional, upstream indications of failure. The putative full path of image_list_train.txt should be revealed a few lines below the occurrence of Starting train-test splitting…. But it is possible that the sequence of commands had already failed before reaching that point.

jcheung · April 1, 2022, 4:25pm

I’m attaching a screenshot of the overview. Nothing has changed with respect to the Topaz installation when it was working in v3.2.0.

I believe some patches have been released for v3.3.

1 since it came out. I’m not sure if those have been applied yet. Are there known Topaz issues in v3.3.1 that the patches fix?

wtempel · April 1, 2022, 4:45pm

@jcheung I suspect there’s useful diagnostic information in Overview (if you scroll up) that might reveal why image_list_train.txt is missing.

jcheung · April 1, 2022, 5:03pm

I’m attaching a screenshot showing the beginning of the overview tab. Does this provide clues as to what could be wrong? This run is a clone of a successful training run that completed without issues before upgrading to v3.3.1. The beginning of the overview is different than the overview from the same run in v3.2.0. I’m not sure if this is due to changes in the Topaz wrapper script or whether something changed in the system, which isn’t something I maintain.

wtempel · April 1, 2022, 5:53pm

We now know the topaz sequence failed during the train/test split or even earlier. To find out more, one can push Show from top in the window’s upper left corner.
There have been changes in the topaz job implementation between cryoSPARC versions 3.2.0 and 3.3.1. Would it be feasible create a “clean” topaz job via the Job Builder button?

jcheung · April 1, 2022, 6:22pm

Here’s a screenshot of the run showing from top using a clean Topaz job. The job failed with the same error as before.

wtempel · April 1, 2022, 6:43pm

This step may already be failing.

There may be more information as you scroll down. has topaz been installed in a conda environment and wrapped in a script like described in the guide?

jcheung · April 1, 2022, 6:50pm

I believe it would have to be installed properly because it was working in v3.2.0, and as far as I know, it hasn’t been re-installed I can forward that guide to IT to double check. The log file doesn’t list any errors but maybe something has changed in the system.

jcheung · April 5, 2022, 4:53pm

I’ve worked with IT to make sure that we are running Topaz through the wrapped script as per the installation instructions. Note that we are able to run Topaz jobs independently of CryoSPARC. I was told the a test run fails in CryoSPARC because the server is running pytorch. Is this meaningful in terms of working towards a resolution? Any insight would be greatly appreciated.

Update: The pytorch issue was only related to having low memory on our card. Once we killed some jobs we were able to verify that Topaz was running properly with the wrapper.

I found the source of the issue now. I can run Topaz in projects (wrapper or no wrapper) that were created since the upgrade. In older projects created before the upgrade, Topaz fails (wrapper or no wrapper) for unknown reason.

wtempel · April 6, 2022, 7:21pm

@jcheung If not being able to run topaz in older projects is a show stopper:

Can you run other jobs in those older projects?
Could project characteristics other than project creation time, such as box size, data volume, etc cause topaz to fail?
Do you have additional error messages to share?

jcheung · April 6, 2022, 8:06pm

I can re-run other jobs in older projects. I believe I found the source of the issue though. It turns out some of the files in old projects were deleted to free up disk space. I was previously aware of this. I think this part of the log file shows what’s missing.

I can confirm that I am able to run Topaz properly in a newly created project.

Sorry for the confusion.