Deep Picker Train Assertion Error

CleoShen · July 20, 2023, 12:51am

Does anyone have a similar issue when using the Deep Picker Train?

Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 199, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Child process with PID 14152 has terminated unexpectedly!

wtempel · July 20, 2023, 3:36pm

Are there error messages in the job log, which you can access under the Metadata|Log tab?

CleoShen · July 20, 2023, 4:25pm

I‘m sorry for replying to such a long job.log, as far as I can see, there is no other error message. BTW, I ran the job on a computer cluster.

================= CRYOSPARCW =======  2023-07-16 20:07:03.980277  =========
Project P8 Job J30
Master sh02-13n06.int Port 39008
===========================================================================
========= monitor process now starting main process at 2023-07-16 20:07:03.980350
MAINPROCESS PID 14080
========= monitor process now waiting for main process
MAIN PID 14080
deep_picker.run_deep_picker cryosparc_compute.jobs.jobregister
2023-07-16 20:07:18.598910: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
========= sending heartbeat at 2023-07-16 20:07:25.949015
========= sending heartbeat at 2023-07-16 20:07:35.965724
2023-07-16 20:07:39.657276: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-07-16 20:07:39.661111: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-07-16 20:07:39.675523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.675848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:03:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: 
pciBusID: 0000:82:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: 
pciBusID: 0000:83:00.0 name: NVIDIA TITAN Xp computeCapability: 6.1
coreClock: 1.582GHz coreCount: 30 deviceMemorySize: 11.90GiB deviceMemoryBandwidth: 510.07GiB/s
2023-07-16 20:07:39.676459: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-07-16 20:07:39.683412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-07-16 20:07:39.683472: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-07-16 20:07:39.686416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-07-16 20:07:39.687542: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-07-16 20:07:39.692914: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-07-16 20:07:39.695505: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-07-16 20:07:39.702830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-07-16 20:07:39.705159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
***************************************************************
Running job on hostname %s brunger_gpu
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'brunger_gpu', 'lane': 'brunger_gpu', 'lane_type': 'cluster', 'license': True, 'licenses_acqu
ired': 4, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0, 1, 2, 3], 'RAM': [0, 1, 2, 3, 4]}, 'target': {'cache_path': '$L_SCRATCH', 'cache_quota_mb': None, 'cac
he_reserve_mb': 10000, 'custom_var_names': ['command'], 'desc': None, 'hostname': 'brunger_gpu', 'lane': 'brunger_gpu', 'name': 'brunger_gpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n##SBATCH -N 1\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH -p brunger \n##SBATCH --nodelist=sh02-13n07\n#SBATCH --mem={{ (ram_gb)|int }}G             \n#SBATCH -o {{ job_dir_abs }}\n#SBATCH -e {{ job_dir_abs }}\n#SBATCH --time=120:00:00\n#SBATCH --error=job.err\n#SBATCH --output=job.out\n\n\navailable_devs=""\nfor devidx in $(seq 0 15);\ndo\n    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n        if [[ -z "$available_devs" ]] ; then\n            available_devs=$devidx\n        else\n            available_devs=$available_devs,$devidx\n        fi\n    fi\ndone\nexport CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n\n', 'send_cmd_tpl': '{{ command }}', 'title': 'ccw0820', 'tpl_vars': ['command', 'num_cpu', 'worker_bin_path', 'run_args', 'job_log_path_abs', 'project_uid', 'job_creator', 'project_dir_abs', 'run_cmd', 'cryosparc_username', 'num_gpu', 'cluster_job_id', 'job_dir_abs', 'job_uid', 'ram_gb'], 'type': 'cluster', 'worker_bin_path': '/home/groups/brunger/software/cryosparc/cryosparc_worker/bin/cryosparcw'}}
========= sending heartbeat at 2023-07-16 20:07:45.982703
......
========= sending heartbeat at 2023-07-16 20:21:01.711007
**** handle exception rc
Traceback (most recent call last):
  File "cryosparc_master/cryosparc_compute/run.py", line 96, in cryosparc_compute.run.main
  File "cryosparc_master/cryosparc_compute/jobs/deep_picker/run_deep_picker.py", line 199, in cryosparc_compute.jobs.deep_picker.run_deep_picker.run_deep_picker_train
AssertionError: Child process with PID 14152 has terminated unexpectedly!
set status to failed
========= main process now complete at 2023-07-16 20:21:11.955807.
========= monitor process now complete at 2023-07-16 20:21:11.961037.

wtempel · July 20, 2023, 9:35pm

Thanks @CleoShen for posting the job log. I also could not infer the cause of Child process with PID 14152 has terminated unexpectedly!
My guess at this moment is that the job may have been terminated by the cluster management software. To confirm the cause, you may

find out the cluster job id, which can be found under the cluster_job_id key in the job document (Metadata|Data).
ask your IT support for details of the termination of this cluster job, providing the cluster job id from the previous step.

CleoShen · July 22, 2023, 1:16am

Hi wtempel, how to set the memory of the Deep Pick Train Job to such like 240G? Currently, the default is 40G; it seems like this is the reason why the job is terminated.

wtempel · August 14, 2023, 7:12pm

deep_picker.builder_deep_picker_train.set_resources_needed() currently specifies 10000 * params['compute_num_gpus'] MB

based on information you posted earlier, your cluster integration script template currently requests RAM

#SBATCH --mem={{ (ram_gb)|int }}G

where ram_gb is the job type-specific (rough) estimate. If you have indications that 40 GB indeed was insufficient for your Deep Picker Train job, you could modify sbatch parameters to

#SBATCH --mem={{ (ram_gb|float * ram_gb_multiplier|float)|int }}G

and update your CryoSPARC instance’s cluster connection. A definition of ram_gb_multiplier would then be required.
ram_gb_multiplier can be defined at the instance, scheduler target or job level (details).
Carefully decide appropriate definition at each level. If the Deep Picker Train job was the first job that failed due to insufficient RAM requested, you may want leave the instance and target-level settings at
ram_gb_multiplier=1, and use ram_gb_multiplier=6 at the job level.
Setting --mem too low may have caused your Deep Picker Train job to fail. Setting --mem at the instance or target level too high would cause some jobs to never run or be delayed unnecessarily if jobs “wait” for RAM resources that they may not need.