Hi, I tried to do topaz training in cryosparc v4.3.1 but the job fialed.The error messenge is as below:
Running job on hostname %s gpu04
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘gpu04’, ‘lane’: ‘gpu04’, ‘lane_type’: ‘node’, ‘license’: False, ‘licenses_acquired’: 0, ‘slots’: {‘CPU’: [0, 1, 2, 3, 4, 5], ‘GPU’: [0], ‘RAM’: [0]}, ‘target’: {‘cache_path’: ‘/mnt/sdb/cryoem’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 16945709056, ‘name’: ‘Tesla V100-SXM2-16GB’}, {‘id’: 1, ‘mem’: 16945709056, ‘name’: ‘Tesla V100-SXM2-16GB’}, {‘id’: 2, ‘mem’: 16945709056, ‘name’: ‘Tesla V100-SXM2-16GB’}, {‘id’: 3, ‘mem’: 16945709056, ‘name’: ‘Tesla V100-SXM2-16GB’}], ‘hostname’: ‘gpu04’, ‘lane’: ‘gpu04’, ‘monitor_port’: None, ‘name’: ‘gpu04’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]}, ‘ssh_str’: ‘1901110494@gpu04’, ‘title’: ‘Worker node gpu04’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/gpfs/share/home/1901110494/cryosparc_v4.3.1/cryosparc_worker/bin/cryosparcw’}}
**** handle exception rc
Traceback (most recent call last):
File “cryosparc_master/cryosparc_compute/run.py”, line 95, in cryosparc_compute.run.main
File “/gpfs/share/home/1901110494/cryosparc_v4.3.1/cryosparc_worker/cryosparc_compute/jobs/topaz/run_topaz.py”, line 359, in run_topaz_wrapper_train
utils.run_process(train_command)
File “/gpfs/share/home/1901110494/cryosparc_v4.3.1/cryosparc_worker/cryosparc_compute/jobs/topaz/topaz_utils.py”, line 98, in run_process
assert process.returncode == 0, f"Subprocess exited with status {process.returncode} ({str_command})"
AssertionError: Subprocess exited with status -7 (/gpfs/share/software/tools/anaconda/3-5.2.0/bin/topaz train --train-images /gpfs/share/home/1901110494/Cryosparc_2/CS-gt-417-91-yb07-legobody-nanodisc/J36/image_list_train.txt --train-targets /gpfs/share/home/1901110494/Cryosparc_2/CS-gt-417-91-yb07-legobo…)
set status to failed
========= main process now complete at 2024-01-31 14:08:47.263247.
========= monitor process now complete at 2024-01-31 14:08:47.268762.
Does anyone know hwo to solve this problem?
Have you confirmed that the topaz training command succeeds outside CryoSPARC?
What is the output of the command
/gpfs/share/software/tools/anaconda/3-5.2.0/bin/python -V
?
Given
I wonder whether topaz received a SIGBUS
signal, which may be logged in the worker computer’s system log. Please can you
-
ask your sysadmin to check for related system log entries around 2024-01-31 14:08:47
-
try a clone of the training job were you reduce some resource-related settings from their defaults, such as
- Number of parallel processes: 2
- Number of CPUs: 2
as suggested under Topaz Preprocessing very slow