Job process terminated abnormally-3D flex reconstruction

wangyan16 · June 27, 2024, 5:28pm

Dear all,

I was running 3d flex reconstruction job with 60k particles (box size 480). However, the job can not run through and showed error: job process terminated abnormally:

Is this a issue with the box size? I saw a previous discussion with similar error, which bin-2 particle would particle but particle without binning would not work.

Thanks,

Yan

wtempel · June 27, 2024, 7:08pm

@wangyan16 Please can you let us know

your CryoSPARC version
the box size of your particles
whether additional error messages can be found in the job log (under Metadata|Log)
specifications for the CryoSPARC worker computer, via these commands:
```
nvidia-smi
free -h
```

wangyan16 · June 27, 2024, 7:38pm

Thanks for your quick reply!

The cryosparc version is 4.5.3. Box size is 460 (pixel size 0.86).

There is an error in the log:

Screenshot 2024-06-27 at 2.36.00 PM

Our work computer: 128 cores, 1032GB memory, 8 GPU (NVIDIA A100 80GB PCIe).

wtempel · June 27, 2024, 8:17pm

Thanks @wangyan16 The screenshot suggests you looked that the Data tab under Metadata. Please can you also check the Log tab for errors.

wangyan16 · June 28, 2024, 2:03am

There is no error in the log. I re-extracted particle with a smaller box size (440) and it worked. So it is something about the computation and pixel size?

wtempel · July 5, 2024, 8:43pm

What was the original box size with which the job failed?

wangyan16 · July 11, 2024, 3:04pm

Sorry for the late reply. It’s 460 vs 440 (pixel size 0.84).

wtempel · July 15, 2024, 3:54pm

Unfortunately, we do not know what caused the segmentation fault. The log also does not suggest to us how to further debug the error.

Rafa · July 24, 2024, 6:00am

Hello,

I have recently observed the exact same behaviour. I had my original particles at a box size of 500 pixels. I set up a training box size of 200 (the expected motions are very large-scale and easily captured even when binning so much). Training proceeds correctly, but I get the same error when running 3D flex reconstruction. I am using 12000 particles, and it fails in the last batch of 1000 for the first iteration of the first half map. I am running this on a A100-SXM4 GPU.

Binning the particles to 440 pixels and using that as the full res blob low level input for the reconstruct job works. Using any box size larger than 440 for the binning results in the same error (i.e., even using 442 results already in the segmentation fault).

Increasing the amount of requested RAM also does not seem to have an impact. For box size of 440 or smaller, it runs successfully using the default 64 GB RAM. For box sizes of 442 or higher, it fails regardless of if I use the default 64 GB RAM or more (I have tried up to 500 GB).

Please let me know if I can share any other useful information. It would be great to troubleshoot this, as we get very promising results with flex 3D!

wtempel · July 24, 2024, 2:39pm

If the job with box size 442 and a request for 500 GB has not been deleted, please can you post the outputs of these commands

cspid="P99"
csjid="J199"
cyosparcm cli "get_job('$cspid', '$csjid', 'job_type', 'version', 'params_spec', 'instance_information', 'input_slot_groups', 'status')" 
cryosparcm eventlog "$cspid" "$csjid" | head -n 40
cryosparcm eventlog "$cspid" "$csjid" | tail -n 40
cryosparcm joblog "$cspid" "$csjid" | tail -n 40

where you replace P99 and J199 with the failed 3D Flex Reconstruction job’s project and job IDs, respectively.

Rafa · July 30, 2024, 4:51am

@wtempel My apologies for the delay in my reply, please find below the output of the 4 commands:

cryosparcm cli "get_job('$cspid', '$csjid', 'job_type', 'version', 'params_spec', 'instance_information', 'input_slot_groups', 'status')"

{'_id': '66a08adab4fd87e3b66cc706', 'input_slot_groups': [{'connections': [{'group_name': 'flex_model', 'job_uid': 'J986', 'slots': [{'group_name': 'flex_model', 'job_uid': 'J986', 'result_name': 'checkpoint', 'result_type': 'flex_model.checkpoint', 'slot_name': 'checkpoint', 'version': 'F'}]}], 'count_max': 1, 'count_min': 1, 'description': '', 'name': 'flex_model', 'repeat_allowed': False, 'slots': [{'description': '', 'name': 'checkpoint', 'optional': False, 'title': 'Checkpoint', 'type': 'flex_model.checkpoint'}], 'title': '3DFlex model', 'type': 'flex_model'}, {'connections': [{'group_name': 'particles', 'job_uid': 'J983', 'slots': [{'group_name': 'particles', 'job_uid': 'J1004', 'result_name': 'blob', 'result_type': 'particle.blob', 'slot_name': 'blob_fullres', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'ctf', 'result_type': 'particle.ctf', 'slot_name': 'ctf', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'alignments3D', 'result_type': 'particle.alignments3D', 'slot_name': 'alignments3D', 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'blob_train', 'result_type': 'particle.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'blob_train_ctf', 'result_type': 'particle.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'blob', 'result_type': 'particle.blob', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'alignments2D', 'result_type': 'particle.alignments2D', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'filament', 'result_type': 'particle.filament', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'pick_stats', 'result_type': 'particle.pick_stats', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'location', 'result_type': 'particle.location', 'slot_name': None, 'version': 'F'}, {'group_name': 'particles', 'job_uid': 'J983', 'result_name': 'sym_expand', 'result_type': 'particle.sym_expand', 'slot_name': None, 'version': 'F'}]}], 'count_max': 1, 'count_min': 1, 'description': 'Particle stacks to use. Multiple stacks will be concatenated.', 'name': 'particles', 'repeat_allowed': False, 'slots': [{'description': '', 'name': 'blob_fullres', 'optional': False, 'title': 'Particle data blobs', 'type': 'particle.blob'}, {'description': '', 'name': 'ctf', 'optional': False, 'title': 'Particle ctf parameters', 'type': 'particle.ctf'}, {'description': '', 'name': 'alignments3D', 'optional': False, 'title': 'Particle 3D alignments', 'type': 'particle.alignments3D'}], 'title': 'Prepared particles', 'type': 'particle'}], 'instance_information': {'CUDA_version': '11.8', 'available_memory': '1.90TB', 'cpu_model': 'AMD EPYC 7763 64-Core Processor', 'driver_version': '12.2', 'gpu_info': [{'id': 0, 'mem': 84987740160, 'name': 'NVIDIA A100-SXM4-80GB'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 128, 'platform_architecture': 'x86_64', 'platform_node': 'saion-gpu26.oist.jp', 'platform_release': '4.18.0-477.15.1.el8_8.x86_64', 'platform_version': '#1 SMP Wed Jun 28 15:04:18 UTC 2023', 'total_memory': '1.97TB', 'used_memory': '59.24GB'}, 'job_type': 'flex_highres', 'params_spec': {'flex_do_noflex_recon': {'value': False}, 'scheduler_no_check_inputs_ready': {'value': True}}, 'project_uid': 'P3', 'status': 'failed', 'uid': 'J1005', 'version': 'v4.4.1'}

cryosparcm eventlog "$cspid" "$csjid" | head -n 40

License is valid.
Launching job on lane largegpu target largegpu ...
Launching job on cluster largegpu
====================== Cluster submission script: ========================
==========================================================================
#!/usr/bin/env bash
#### cryoSPARC cluster submission script template for SLURM
## Available variables:
## /path/to/cryosparc_worker/bin/cryosparcw run --project P3 --job J1005 --master_hostname hostname --master_command_core_port 39002 > /path/to/project/folder/J1005/job.log 2>&1             - the complete command string to run the job
## 4            - the number of CPUs needed
## 1            - the number of GPUs needed. 
##                            Note: the code will use this many GPUs starting from dev id 0
##                                  the cluster scheduler or this script have the responsibility
##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up
##                                  using the correct cluster-allocated GPUs.
## 64.0             - the amount of RAM needed in GB
## /path/to/project/folder/J1005        - absolute path to the job directory
## /path/to/project/folder    - absolute path to the project dir
## /path/to/project/folder/J1005/job.log   - absolute path to the log file for the job
## /path/to/cryosparc_worker/bin/cryosparcw    - absolute path to the cryosparc worker command
## --project P3 --job J1005 --master_hostname saion-login1.oist.jp --master_command_core_port 39002           - arguments to be passed to cryosparcw run
## P3        - uid of the project
## J1005            - uid of the job
## name_of_user        - name of the user that created the job (may contain spaces)
## user@email.com - cryosparc username of the user that created the job (usually an email)
##
## What follows is a simple SLURM script:

#SBATCH --job-name cryosparc_P3_J1005
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH --partition=largegpu
#SBATCH --mem=499G
##SBATCH --mem=300000MB
#SBATCH --output=/path/to/project/folder/J1005/job.log
#SBATCH --error=/path/to/project/folder/J1005/job.log


##available_devs=""
##for devidx in $(seq 0 15);

cryosparcm eventlog "$cspid" "$csjid" | tail -n 40

[CPU RAM used: 443 MB] ***************************************************************
[CPU RAM used: 452 MB] ====== 3D Flex Load Checkpoint =======
[CPU RAM used: 452 MB]   Loading checkpoint from J986/J986_train_checkpoint_006600.tar ...
[CPU RAM used: 791 MB]   Initializing torch..
[CPU RAM used: 791 MB]   Initializing model from checkpoint...
Input tetramesh
[CPU RAM used: 1004 MB]   Upscaling deformation model to match input volume size...
Upsampled mask
Upsampled tetramesh
[CPU RAM used: 2877 MB] ====== Load particle data =======
[CPU RAM used: 2878 MB]   Reading in all particle data on the fly from files...
[CPU RAM used: 2878 MB] Loading a ParticleStack with 12000 items...
[CPU RAM used: 2878 MB]  SSD cache : cache successfully synced in_use
[CPU RAM used: 2878 MB]  SSD cache : cache successfully synced, found 32,029.75 MB of files on SSD.
[CPU RAM used: 2878 MB]  SSD cache : cache successfully requested to check 12 files.
[CPU RAM used: 2878 MB] Detected file change due to modification time.
[CPU RAM used: 2878 MB]  SSD cache : cache requires 8,943 MB more on the SSD for files to be downloaded.
[CPU RAM used: 2878 MB]  SSD cache : cache has enough available space.
[CPU RAM used: 2878 MB]   Needed        |     8,943.07 MB
  Available     | 10,875,058.64 MB
  Disk size     | 10,984,866.00 MB
  Usable space  | 10,974,866.00 MB  (reserve 10,000 MB)
[CPU RAM used: 2878 MB] Transferring across 2 threads: batch_000010_downsample.mrc (12/12)
  Progress      |        8,943 MB (100.00%)
  Total         |        8,943 MB
  Average speed |     6,382.42 MB/s
  ETA           |      0h  0m  0s
[CPU RAM used: 2879 MB]  SSD cache : complete, all requested files are available on SSD.
[CPU RAM used: 2884 MB]   Done.
[CPU RAM used: 2884 MB]   Preparing all particle CTF data...
[CPU RAM used: 2884 MB]   Parameter "Force re-do GS split" was off. Using input split..
[CPU RAM used: 2884 MB]     Split A contains 6000 particles
[CPU RAM used: 2884 MB]     Split B contains 6000 particles
[CPU RAM used: 2884 MB]   Setting up particle poses..
[CPU RAM used: 2884 MB] ====== High resolution flexible refinement =======
[CPU RAM used: 2884 MB]   Max num L-BFGS iterations was set to 20
[CPU RAM used: 2884 MB]   Starting L-BFGS.
[CPU RAM used: 2884 MB]   Reconstructing half-map A
[CPU RAM used: 2884 MB]     Iteration 0 : 5000 / 6000 particles
[CPU RAM used: 196 MB] ====== Job process terminated abnormally.

cryosparcm joblog "$cspid" "$csjid" | tail -n 40

Project P3 Job J1005
Master master_hostname Port 39002
===========================================================================
========= monitor process now starting main process at 2024-07-24 14:45:33.698634
MAINPROCESS PID 686645
MAIN PID 686645
flex_refine.run_highres cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat at 2024-07-24 14:45:48.301647
========= sending heartbeat at 2024-07-24 14:45:58.321484
========= sending heartbeat at 2024-07-24 14:46:08.341454
========= sending heartbeat at 2024-07-24 14:46:18.353252
========= sending heartbeat at 2024-07-24 14:46:28.367796
========= sending heartbeat at 2024-07-24 14:46:38.380905
========= sending heartbeat at 2024-07-24 14:46:48.401071
========= sending heartbeat at 2024-07-24 14:46:58.421064
========= sending heartbeat at 2024-07-24 14:47:08.441066
***************************************************************
Running job  J1005  of type  flex_highres
Running job on hostname %s largegpu
Allocated Resources :  {'fixed': {'SSD': False}, 'hostname': 'largegpu', 'lane': 'largegpu', 'lane_type': 'cluster', 'license': True, 'licenses_acquired': 1, 'slots': {'CPU': [0, 1, 2, 3], 'GPU': [0], 'RAM': [0, 1, 2, 3, 4, 5, 6, 7]}, 'target': {'cache_path': '/scratch', 'cache_quota_mb': None, 'cache_reserve_mb': 10000, 'custom_var_names': ['ram_gb_multiplier'], 'custom_vars': {}, 'desc': None, 'hostname': 'largegpu', 'lane': 'largegpu', 'name': 'largegpu', 'qdel_cmd_tpl': 'scancel {{ cluster_job_id }}', 'qinfo_cmd_tpl': 'sinfo', 'qstat_cmd_tpl': 'squeue -j {{ cluster_job_id }}', 'qstat_code_cmd_tpl': 'squeue -j {{ cluster_job_id }} --format=%T | sed -n 2p', 'qsub_cmd_tpl': 'sbatch {{ script_path_abs }}', 'script_tpl': '#!/usr/bin/env bash\n#### cryoSPARC cluster submission script template for SLURM\n## Available variables:\n## {{ run_cmd }}            - the complete command string to run the job\n## {{ num_cpu }}            - the number of CPUs needed\n## {{ num_gpu }}            - the number of GPUs needed. \n##                            Note: the code will use this many GPUs starting from dev id 0\n##                                  the cluster scheduler or this script have the responsibility\n##                                  of setting CUDA_VISIBLE_DEVICES so that the job code ends up\n##                                  using the correct cluster-allocated GPUs.\n## {{ ram_gb }}             - the amount of RAM needed in GB\n## {{ job_dir_abs }}        - absolute path to the job directory\n## {{ project_dir_abs }}    - absolute path to the project dir\n## {{ job_log_path_abs }}   - absolute path to the log file for the job\n## {{ worker_bin_path }}    - absolute path to the cryosparc worker command\n## {{ run_args }}           - arguments to be passed to cryosparcw run\n## {{ project_uid }}        - uid of the project\n## {{ job_uid }}            - uid of the job\n## {{ job_creator }}        - name of the user that created the job (may contain spaces)\n## {{ cryosparc_username }} - cryosparc username of the user that created the job (usually an email)\n##\n## What follows is a simple SLURM script:\n\n#SBATCH --job-name cryosparc_{{ project_uid }}_{{ job_uid }}\n#SBATCH -n {{ num_cpu }}\n#SBATCH --gres=gpu:{{ num_gpu }}\n#SBATCH --partition=largegpu\n#SBATCH --mem={{ (ram_gb|float * (ram_gb_multiplier|default(1))|float)|int}}G\n##SBATCH --mem={{ (300*1000)|int }}MB\n#SBATCH --output={{ job_log_path_abs }}\n#SBATCH --error={{ job_log_path_abs }}\n\n\n##available_devs=""\n##for devidx in $(seq 0 15);\n##do\n##    if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid --format=csv,noheader) ]] ; then\n##        if [[ -z "$available_devs" ]] ; then\n##            available_devs=$devidx\n##        else\n##            available_devs=$available_devs,$devidx\n##        fi\n##    fi\n##done\n##export CUDA_VISIBLE_DEVICES=$available_devs\n\n{{ run_cmd }}\n', 'send_cmd_tpl': '{{ command }}', 'title': 'largegpu', 'tpl_vars': ['num_cpu', 'job_log_path_abs', 'command', 'cluster_job_id', 'ram_gb', 'num_gpu', 'run_cmd', 'run_args', 'project_uid', 'worker_bin_path', 'job_creator', 'project_dir_abs', 'ram_gb_multiplier', 'cryosparc_username', 'job_uid', 'job_dir_abs'], 'type': 'cluster', 'worker_bin_path': '/path/to/cryosparc_saion-login1/cryosparc_worker/bin/cryosparcw'}}
Transferring across 2 threads: (0/12)
Transferring across 2 threads: batch_000001_downsample.mrc (1/12)
Transferring across 2 threads: batch_000000_downsample.mrc (2/12)
Transferring across 2 threads: batch_000003_downsample.mrc (3/12)
Transferring across 2 threads: batch_000002_downsample.mrc (4/12)
Transferring across 2 threads: batch_000004_downsample.mrc (5/12)
Transferring across 2 threads: batch_000005_downsample.mrc (6/12)
Transferring across 2 threads: batch_000006_downsample.mrc (7/12)
Transferring across 2 threads: batch_000007_downsample.mrc (8/12)
Transferring across 2 threads: batch_000008_downsample.mrc (9/12)
Transferring across 2 threads: batch_000009_downsample.mrc (10/12)
Transferring across 2 threads: batch_000011_downsample.mrc (11/12)
Transferring across 2 threads: batch_000010_downsample.mrc (12/12)
========= sending heartbeat at 2024-07-24 14:47:18.455909
========= sending heartbeat at 2024-07-24 14:47:28.468209
========= sending heartbeat at 2024-07-24 14:47:38.488464
========= sending heartbeat at 2024-07-24 14:47:48.507843
========= main process now complete at 2024-07-24 14:47:55.151271.
========= monitor process now complete at 2024-07-24 14:47:55.230972.

wtempel · July 30, 2024, 1:32pm

Thanks @Rafa for posting this information.

Please can you confirm the particle box size for job J1005? Was it 442?
Did you see a segmentation fault for this job?
Please can you post the output of the command
```
cryosparcm cli "get_job('P3', '$csjid', 'job_type', 'version', 'params_spec', 'instance_information', 'input_slot_groups', 'status')"
```
where you replace $csjid with the job ID of a successful run that is most similar to J1005, and describe any differences in the upstream history of the connected inputs.

Anamika1 · August 6, 2024, 1:00pm

Hello everyone,

I observe the same issue, I was wondering if there is an fix to the issue.

Anamika

wtempel · August 7, 2024, 8:29pm

@Anamika1 Please can you post additional information:

what worked
what parameter or input changes triggered the error

outputs of

wtempel:

cspid="P99"
csjid="J199"
cyosparcm cli "get_job('$cspid', '$csjid', 'job_type', 'version', 'params_spec', 'instance_information', 'input_slot_groups', 'status')" 
cryosparcm eventlog "$cspid" "$csjid" | head -n 40
cryosparcm eventlog "$cspid" "$csjid" | tail -n 40
cryosparcm joblog "$cspid" "$csjid" | tail -n 40

where you replace P99 and J199 with the failed 3D Flex Reconstruction job’s project and job IDs, respectively.