I‘m sorry for replying to such a long job.log, as far as I can see, there is no other error message. BTW, I ran the job on a computer cluster.
================= CRYOSPARCW ======= 2023-07-16 20:07:03.980277 =========
Project P8 Job J30
Master sh02-13n06.int Port 39008
===========================================================================
========= monitor process now starting main process at 2023-07-16 20:07:03.980350
MAINPROCESS PID 14080
========= monitor process now waiting for main process
MAIN PID 14080
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
2023-07-16 20:07:18.598910: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
========= sending heartbeat at 2023-07-16 20:07:25.949015
========= sending heartbeat at 2023-07-16 20:07:35.965724
2023-07-16 20:07:39.657276: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-07-16 20:07:39.661111: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-07-16 20:07:39.675523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.675848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:03:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:82:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:83:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676459: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-07-16 20:07:39.683412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-07-16 20:07:39.683472: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-07-16 20:07:39.686416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-07-16 20:07:39.687542: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-07-16 20:07:39.692914: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-07-16 20:07:39.695505: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-07-16 20:07:39.702830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-07-16 20:07:39.705159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
***************************************************************
Running job on hostname %s brunger_gpu
Allocated Resources : {'fixed': {'SSD': False}, 'hostname': 'brunger_gpu', 'lane': 'brunger_gpu', 'lane_type': 'cluster', 'license': True, 'licenses_acqu
ired': 4, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4]}, 'target': {'cache_path': '$L_SCRATCH', 'cache_quota_mb': None, 'cac
he_reserve_mb': 10000, 'custom_var_names': ['command'], 'desc': None, 'hostname': 'brunger_gpu', 'lane': 'brunger_gpu', 'name': 'brunger_gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }} - the complete command string to run the job\n## {{ num_cpu }} - the number of CPUs needed\n## {{ num_gpu }} - the number of GPUs needed. \n## Note: the code will use this many GPUs starting from dev id 0\n## the cluster scheduler or this script have the responsibility\n## of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n## using the correct cluster-allocated GPUs.\n## {{ ram_gb }} - the amount of RAM needed in GB\n## {{ job_dir_abs }} - absolute path to the job directory\n## {{ project_dir_abs }} - absolute path to the project dir\n## {{ job_log_path_abs }} - absolute path to the log file for the job\n## {{ worker_bin_path }} - absolute path to the cryosparc worker command\n## {{ run_args }} - arguments to be passed to cryosparcw run\n## {{ project_uid }} - uid of the project\n## {{ job_uid }} - uid of the job\n## {{ job_creator }} - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p brunger \n##SBATCH --nodelist=sh02-13n07\n#SBATCH --mem={{ (ram_gb)|int }}G \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=120:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n if [[ -z "$available_devs" ]] ; then\n available_devs=$devidx\n else\n available_devs=$available_devs,$devidx\n fi\n fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'ccw0820', 'tpl_vars': ['command', 'num_cpu', 'worker_bin_path', 'run_args', 'job_log_path_abs', 'project_uid', 'job_creator', 'project_dir_abs', 'run_cmd', 'cryosparc_username', 'num_gpu', 'cluster_job_id', 'job_dir_abs', 'job_uid', 'ram_gb'], 'type': 'cluster', 'worker_bin_path': '/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw'}}
========= sending heartbeat at 2023-07-16 20:07:45.982703
......
========= sending heartbeat at 2023-07-16 20:21:01.711007
**** handle exception rc
Traceback (most recent call last):
File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 199, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Child process with PID 14152 has terminated unexpectedly!
set status to failed
========= main process now complete at 2023-07-16 20:21:11.955807.
========= monitor process now complete at 2023-07-16 20:21:11.961037.