Topaz train errors while running

Hello everyone,

I have been getting errors recently while running Topaz train job.

It seems inconsistent: sometimes it works, and other times it just crashes.

Please see the error below:

This is my GPU station configuration:

My CryoSparc version is: v4.5.3.

I would greatly appreciate your help and insights on how to fix the issue. Do I need to upgrade the CryoSparc version, or is the CUDA version not compatible?

Many thanks,

Salima

Welcome to the forum Salima @S_12_Daou . Please can you

  1. post the outputs of these commands on the worker node where Topaz training failed:
    uname -a
    free -h
    
  2. let us know whether the node also runs significant workloads outside CryoSPARC
  3. post the outputs of these commands on the CryoSPARC master host. Please replace /path/to with the actual path of the directory that contains the cryosparc_master/ directory.
    cd /path/to/cryosparc_master/ # replace with actual path
    csprojectid='P19'
    csjobid='J220'
    ./bin/cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'input_slot_groups', 'started_at')" 
    ./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40
    ./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 20
    cd - # back to previous directory
    

[edited 2025-11-24 with corrected instructions]

Hello @wtempel

Thanks a lot for your reply,

1- Please see below the outputs of the commands:

csprojectid=‘P19’

csjobid=‘J220’

cryosparcm cli “get_job(‘$csprojectid’, ‘$csjobid’, ‘job_type’, ‘version’, ‘instance_information’, ‘status’, ‘params_spec’, ‘errors_run’, ‘input_slot_groups’, ‘started_at’)”

cryosparcm eventlog $csprojectid $csjobid | tail -n 40

cryosparcm joblog $csprojectid $csjobid | tail -n 20

uname -a

free -h

bash: cryosparcm: command not found…

bash: cryosparcm: command not found…

bash: cryosparcm: command not found…

Linux quad3.XXX 5.14.0-570.28.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 24 10:32:22 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

           total        used        free      shared  buff/cache   available

Mem: 125Gi 10Gi 12Gi 21Mi 104Gi 115Gi

Swap: 31Gi 2.4Gi 29Gi

2- For the second question, I am not sure I understand it. Would you please give me further insights?

Many thanks for your help, I greatly appreciate it.

Salima

Apologies for incorrect instructions. Please can you post the outputs of these commands on the CryoSPARC master host. Please replace /path/to with the actual path of the directory that contains the cryosparc_master/ directory.

cd /path/to/cryosparc_master/ # replace with actual path
csprojectid='P19'
csjobid='J220'
./bin/cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'input_slot_groups', 'started_at')" 
./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40
./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 20
cd - # back to previous directory

Hello @wtempel,

Thanks a lot for reaching out again, I appreciate it.

Please find below the output of the commands:

\[sdaou@quad3 cryoDB\]$ cd /cryoDB/cryosparc_master

\[sdaou@quad3 cryosparc_master\]$ 

\[sdaou@quad3 cryosparc_master\]$ 

\[sdaou@quad3 cryosparc_master\]$ 

\[sdaou@quad3 cryosparc_master\]$ 

\[sdaou@quad3 cryosparc_master\]$ csprojectid='P19'

csjobid='J220'

./bin/cryosparcm cli "get_job('$csprojectid', '$csjobid', 'job_type', 'version', 'instance_information', 'status',  'params_spec', 'errors_run', 'input_slot_groups', 'started_at')" 

./bin/cryosparcm eventlog $csprojectid $csjobid | tail -n 40

./bin/cryosparcm joblog $csprojectid $csjobid | tail -n 20

{'\_id': '690633ee1b25f930335eeec0', 'errors_run': \[{'message': 'Subprocess exited with status 1 (/programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train --train-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_train.txt --train-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_parti…)', 'warning': False}\], 'input_slot_groups': \[{'connections': \[{'group_name': 'exposures_accepted', 'job_uid': 'J219', 'slots': \[{'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'micrograph_blob', 'result_type': 'exposure.micrograph_blob', 'slot_name': 'micrograph_blob', 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'mscope_params', 'result_type': 'exposure.mscope_params', 'slot_name': 'mscope_params', 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'movie_blob', 'result_type': 'exposure.movie_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'background_blob', 'result_type': 'exposure.stat_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'micrograph_thumbnail_blob_1x', 'result_type': 'exposure.thumbnail_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'micrograph_thumbnail_blob_2x', 'result_type': 'exposure.thumbnail_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'ctf', 'result_type': 'exposure.ctf', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'ctf_stats', 'result_type': 'exposure.ctf_stats', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'rigid_motion', 'result_type': 'exposure.motion', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'spline_motion', 'result_type': 'exposure.motion', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'micrograph_blob_non_dw', 'result_type': 'exposure.micrograph_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'micrograph_blob_non_dw_AB', 'result_type': 'exposure.micrograph_blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'exposures_accepted', 'job_uid': 'J219', 'result_name': 'gain_ref_blob', 'result_type': 'exposure.gain_ref_blob', 'slot_name': None, 'version': 'F'}\]}\], 'count_max': inf, 'count_min': 1, 'description': 'Micrographs for training Topaz', 'name': 'micrographs', 'repeat_allowed': False, 'slots': \[{'description': '', 'name': 'micrograph_blob', 'optional': False, 'title': 'Raw micrograph data', 'type': 'exposure.micrograph_blob'}, {'description': '', 'name': 'micrograph_blob_denoised', 'optional': True, 'title': 'Denoised micrograph data', 'type': 'exposure.micrograph_blob'}, {'description': '', 'name': 'mscope_params', 'optional': True, 'title': 'Microscope parameters for identifying negatively stained data', 'type': 'exposure.mscope_params'}\], 'title': 'Micrographs', 'type': 'exposure'}, {'connections': \[{'group_name': 'particles_accepted', 'job_uid': 'J219', 'slots': \[{'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'location', 'result_type': 'particle.location', 'slot_name': 'location', 'version': 'F'}, {'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'pick_stats', 'result_type': 'particle.pick_stats', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'alignments3D', 'result_type': 'particle.alignments3D', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'ctf', 'result_type': 'particle.ctf', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'blob', 'result_type': 'particle.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles_accepted', 'job_uid': 'J219', 'result_name': 'alignments2D', 'result_type': 'particle.alignments2D', 'slot_name': None, 'version': 'F'}\]}\], 'count_max': inf, 'count_min': 1, 'description': 'Particle locations for training Topaz', 'name': 'particles', 'repeat_allowed': False, 'slots': \[{'description': '', 'name': 'location', 'optional': False, 'title': 'Particle locations', 'type': 'particle.location'}\], 'title': 'Particles', 'type': 'particle'}\], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '117.76GB', 'cpu_model': 'AMD Ryzen Threadripper PRO 5965WX 24-Cores', 'driver_version': '12.9', 'gpu_info': \[{'id': 0, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000', 'pcie': '0000:01:00'}, {'id': 1, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000', 'pcie': '0000:21:00'}, {'id': 2, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000', 'pcie': '0000:41:00'}, {'id': 3, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000', 'pcie': '0000:42:00'}\], 'ofd_hard_limit': 524288, 'ofd_soft_limit': 1024, 'physical_cores': 24, 'platform_architecture': 'x86_64', 'platform_node': 'quad3.mshri.on.ca', 'platform_release': '5.14.0-570.28.1.el9_6.x86_64', 'platform_version': '#1 SMP PREEMPT_DYNAMIC Thu Jul 24 10:32:22 UTC 2025', 'total_memory': '125.13GB', 'used_memory': '6.17GB'}, 'job_type': 'topaz_train', 'params_spec': {'compute_num_workers': {'value': 1}, 'exec_path': {'value': '/programs/x86_64-linux/topaz/0.2.5-2/bin/topaz'}, 'num_distribute': {'value': 1}, 'num_particles': {'value': 2000}, 'par_diam': {'value': 130}, 'pretrained': {'value': True}}, 'project_uid': 'P19', 'started_at': 'Sun, 02 Nov 2025 21:14:30 GMT', 'status': 'failed', 'uid': 'J220', 'version': 'v4.5.3'}

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/main.py", line 148, in main

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] args.func(args)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 695, in main

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] , save_prefix=save_prefix, use_cuda=use_cuda, output=output)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 577, in fit_epochs

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] , use_cuda=use_cuda, output=output)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 557, in fit_epoch

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] metrics = step_method.step(X, Y)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/methods.py", line 103, in step

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] score = self.model(X).view(-1)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in \__call_\_

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] result = self.forward(\*input, \*\*kwargs)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/model/classifier.py", line 28, in forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] z = self.features(x)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in \__call_\_

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] result = self.forward(\*input, \*\*kwargs)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 54, in forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] z = self.features(x)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in \__call_\_

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] result = self.forward(\*input, \*\*kwargs)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] input = module(input)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in \__call_\_

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] result = self.forward(\*input, \*\*kwargs)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 270, in forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] y = self.conv(x)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in \__call_\_

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] result = self.forward(\*input, \*\*kwargs)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] return self.conv2d_forward(input, self.weight)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] File "/programs/x86_64-linux/topaz/0.2.5-2/topaz_extlib/miniconda3-4.8.2-b5qb/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] self.padding, self.dilation, self.groups)

\[Sun, 02 Nov 2025 21:58:26 GMT\] \[CPU RAM used: 326 MB\] RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

\[Sun, 02 Nov 2025 21:58:27 GMT\] \[CPU RAM used: 326 MB\] Traceback (most recent call last):

  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main

  File "/cryoDB/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py", line 384, in run_topaz_wrapper_train

    utils.run_process(train_command)

  File "/cryoDB/cryosparc_worker/cryosparc_compute/jobs/topaz/topaz_utils.py", line 99, in run_process

    assert process.returncode == 0, f"Subprocess exited with status {process.returncode} ({str_command})"

AssertionError: Subprocess exited with status 1 (/programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train --train-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_train.txt --train-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_parti…)

========= sending heartbeat at 2025-11-02 16:57:36.636152

========= sending heartbeat at 2025-11-02 16:57:46.650376

========= sending heartbeat at 2025-11-02 16:57:56.664358

========= sending heartbeat at 2025-11-02 16:58:06.678569

========= sending heartbeat at 2025-11-02 16:58:16.692996

========= sending heartbeat at 2025-11-02 16:58:26.709436

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Running job on hostname %s quad3.mshri.on.ca

Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'quad3.mshri.on.ca', 'lane': 'default', 'lane_type': 'node', 'license': False, 'licenses_acquired': 0, 'slots': {'CPU': \[0\], 'GPU': \[0\], 'RAM': \[0\]}, 'target': {'cache_path': '/ssd_cache', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'desc': None, 'gpus': \[{'id': 0, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000'}, {'id': 1, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000'}, {'id': 2, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000'}, {'id': 3, 'mem': 50897289216, 'name': 'NVIDIA RTX A6000'}\], 'hostname': 'quad3.mshri.on.ca', 'lane': 'default', 'monitor_port': None, 'name': 'quad3.mshri.on.ca', 'resource_fixed': {'SSD': True}, 'resource_slots': {'CPU': \[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47\], 'GPU': \[0, 1, 2, 3\], 'RAM': \[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15\]}, 'ssh_str': 'cryspc@quad3.mshri.on.ca', 'title': 'Worker node quad3.mshri.on.ca', 'type': 'node', 'worker_bin_path': '/cryoDB/cryosparc_worker/bin/cryosparcw'}}

\*\*\*\* handle exception rc

Traceback (most recent call last):

  File "cryosparc_master/cryosparc_compute/run.py", line 115, in cryosparc_master.cryosparc_compute.run.main

  File "/cryoDB/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py", line 384, in run_topaz_wrapper_train

    utils.run_process(train_command)

  File "/cryoDB/cryosparc_worker/cryosparc_compute/jobs/topaz/topaz_utils.py", line 99, in run_process

    assert process.returncode == 0, f"Subprocess exited with status {process.returncode} ({str_command})"

AssertionError: Subprocess exited with status 1 (/programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train --train-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_train.txt --train-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_parti…)

set status to failed

========= main process now complete at 2025-11-02 16:58:36.724043.

========= monitor process now complete at 2025-11-02 16:58:36.728125.

Thanks a lot,

Salima

Thanks @S_12_Daou Please can you also post the output of the command

/cryoDB/cryosparc_master/bin/cryosparcm eventlog P19 J220 | grep -m1 "topaz train"

Hello @wtempel,

Thank you,

Please find below the output for the command:

[sdaou@quad3 bin]$ /cryoDB/cryosparc_master/bin/cryosparcm eventlog P19 J220 | grep -m1 “topaz train”

[Sun, 02 Nov 2025 21:50:57 GMT] [CPU RAM used: 326 MB] Starting dataset splitting by running command /programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train_test_split --number 90 --seed 278009619 --image-dir /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/preprocessed /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed.txt

[sdaou@quad3 bin]$

Thanks @S_12_Daou . It may be time to try the topaz train command outside CryoSPARC, as suggested in Topaz: CUDNN error: CUDNN STATUS MAPPING ERROR - #2 by alexjamesnoble. What is the output of the following command (which includes an extra space):

/cryoDB/cryosparc_master/bin/cryosparcm eventlog P19 J220 | grep -m1 "topaz train "

(For the run outside CryoSPARC, you may want to modify some parameters to avoid writing into the CryoSPARC project directory.)

Hello @wtempel,

Thank you,

Please find below the output for the command:

[sdaou@quad3 bin]$ eventlog P19 J220 | grep -m1 “topaz train”

bash: eventlog: command not found…

[sdaou@quad3 bin]$ /cryoDB/cryosparc_master/bin/cryosparcm eventlog P19 J220 | grep -m1 “topaz train”

[Sun, 02 Nov 2025 21:50:57 GMT] [CPU RAM used: 326 MB] Starting dataset splitting by running command /programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train_test_split --number 90 --seed 278009619 --image-dir /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/preprocessed /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed.txt

[sdaou@quad3 bin]$ /cryoDB/cryosparc_master/bin/cryosparcm eventlog P19 J220 | grep -m1 "topaz train "

[Sun, 02 Nov 2025 21:51:01 GMT] [CPU RAM used: 326 MB] Starting training by running command /programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train --train-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_train.txt --train-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed_train.txt --test-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_test.txt --test-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed_test.txt --num-particles 2000 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 1 --cross-validation-seed 278009619 --radius 3 --num-particles 2000 --device 0 --save-prefix=/Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/models/model -o /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/train_test_curve.txt

[sdaou@quad3 bin]$

If I want to run outside CryoSPARC, where do I run it? In what directory in the computer?

Would you please let me know how to modify the parameters?

Many thanks,

Salima

Create and change into an empty directory.

You will want to modify the --save-prefix= and -o parameters such that outputs are created inside the new, empty directory, for example:

cd /where/to/create/new/dir/ # replace with actual, suitable path
mkdir topaz_train_1
cd topaz_train_1/
/usr/bin/nohup /programs/x86_64-linux/topaz/0.2.5-2/bin/topaz train \
  --train-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_train.txt \
  --train-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed_train.txt \
  --test-images /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/image_list_test.txt \
  --test-targets /Stx2Data3/Salima/CS-mark3-crbn-ddb1-protac-glacios-october-2025/J220/topaz_particles_processed_test.txt \
  --num-particles 2000 --learning-rate 0.0002 --minibatch-size 128 \
  --num-epochs 10 --method GE-binomial --slack -1 --autoencoder 0 \
  --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 \
  --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 1 \
  --cross-validation-seed 278009619 --radius 3 --num-particles 2000 --device 0 \
  --save-prefix=model -o train_test_curve.txt &

When you run this job outside CryoSPARC, the CryoSPARC internal scheduler will not be “aware” that Topaz is using a GPU and may run a job on the same GPU. Depending on how “busy” the workstation and its GPUs are, you may want to also modify the
--device parameter. Instead of --device 0, you could specify, for example, --device 3, where the number confusingly may not match the GPU # from nvidia-smi, but instead the id from the command

/cryoDB/cryosparc_worker/bin/cryosparcw gpulist