Child process with PID xxxxxxx terminated unexpectedly with exit code -9

Hi all,

I was running Patch motion correction and it just stopped in the middle and only corrected roughly half of the micrographs. The error shows ‘Child process with PID 1116753 terminated unexpectedly with exit code -9’. I tried to skip that particular micrograph and it corrects just a couple more micrographs before it stops with the same error message. Can anyone tell me how to fix this?

Cryosparc : V4.5.3
movies are in tiff format.

Welcome to the forum @Sdk .
Please can you post the outputs of these commands

cryosparcm eventlog P99 J199 | head -n 40
cryosparcm joblog P99 J199 | tail -n 40
cryosparcm cli "get_job('P99', 'J199', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"

where you replace P99, J199 with the import job’s project and job IDs, respectively.

Thank you so much for your reply!

Here is what I got:

[csparc@biomix ~]$ cryosparcm eventlog P5 J511 | head -n 40
[Sat, 13 Jul 2024 19:25:13 GMT]  License is valid.
[Sat, 13 Jul 2024 19:25:13 GMT]  Launching job on lane biomix target biomix ...
[Sat, 13 Jul 2024 19:25:13 GMT]  Launching job on cluster biomix
[Sat, 13 Jul 2024 19:25:13 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P5_J511
#SBATCH --partition=cryosparc
#SBATCH --output=/mnt/parashar/cspark_files/CS-sm74/J511/job.log
#SBATCH --error=/mnt/parashar/cspark_files/CS-sm74/J511/job.log
#SBATCH --nodes=1
#SBATCH --mem=24000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:0
#SBATCH --gres-flags=enforce-binding
srun /usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw run --project P5 --job J511 --master_hostname biomix.dbi.udel.edu --master_command_core_port 39002 > /mnt/parashar/cspark_files/CS-sm74/J511/job.log 2>&1 
==========================================================================
==========================================================================
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Submission command: 
sbatch /mnt/parashar/cspark_files/CS-sm74/J511/queue_sub_script.sh
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Cluster Job ID: 
679600
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Queued on cluster at 2024-07-13 15:25:13.622264
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Cluster job status at 2024-07-13 15:25:13.638782 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            679600 cryosparc cryospar   csparc PD       0:00      1 (None)
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Job J511 Started
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Working in directory: /mnt/parashar/cspark_files/CS-sm74/J511
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Running on lane biomix
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   Worker:  biomix
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   CPU   :  [0]
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   GPU   :  []
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   RAM   :  [0, 1, 2]
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Importing job module for job type import_movies...


[csparc@biomix ~]$ cryosparcm joblog P5 J511 | tail -n 40  
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.


================= CRYOSPARCW =======  2024-07-13 19:25:14.729172  =========
Project P5 Job J511
Master biomix.dbi.udel.edu Port 39002
===========================================================================
MAIN PROCESS PID 1114276
========= now starting main process at 2024-07-13 19:25:14.729602
MONITOR PROCESS PID 1114278
========= monitor process now waiting for main process
========= sending heartbeat at 2024-07-13 19:25:15.700782
imports.run cryosparc_compute.jobs.jobregister
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
========= sending heartbeat at 2024-07-13 19:25:25.714150
========= sending heartbeat at 2024-07-13 19:25:35.730154
========= sending heartbeat at 2024-07-13 19:25:45.746151
========= sending heartbeat at 2024-07-13 19:25:55.762149
***************************************************************
min: 1020.934750 max: 1277.435368
min: 15255.508331 max: 17065.165497
min: 438.812332 max: 589.635666
min: 92.065414 max: 165.044571
min: 1030.482346 max: 1277.246902
min: 1042.400784 max: 1273.001316
***************************************************************
========= main process now complete at 2024-07-13 19:26:05.021916
========= sending heartbeat at 2024-07-13 19:26:05.778165
  ========= heartbeat failed at 2024-07-13 19:26:05.786042: 


[csparc@biomix ~]$ cryosparcm cli "get_job('P5', 'J511', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"
{'_id': '6692d3fe03031811dbb40d88', 'instance_information': {'available_memory': '247.11GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4410Y', 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 24, 'platform_architecture': 'x86_64', 'platform_node': 'biomix10', 'platform_release': '5.15.0-105-generic', 'platform_version': '#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024', 'total_memory': '251.55GB', 'used_memory': '1.89GB'}, 'job_type': 'import_movies', 'params_spec': {'accel_kv': {'value': 300}, 'blob_paths': {'value': '/mnt/parashar/LBMS/July_2024_Data-collection/Sm_HEME-KCN/*.tiff'}, 'cs_mm': {'value': 2.7}, 'gainref_path': {'value': '/mnt/parashar/LBMS/July_2024_Data-collection/Sm_CDA/Gain_20240708_105kx.mrc'}, 'psize_A': {'value': 0.4125}, 'total_dose_e_per_A2': {'value': 60}}, 'project_uid': 'P5', 'status': 'completed', 'uid': 'J511', 'version': 'v4.5.3'}

@Sdk Thanks for posting the information. May I additionally ask for the output of the command

cryosparcm eventlog P5 J511 | tail -n 40
[csparc@biomix ~]$ cryosparcm eventlog P5 J511 | tail -n 40
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Importing job module for job type import_movies...
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] Job ready to run
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] ***************************************************************
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] Importing movies from /mnt/parashar/LBMS/July_2024_Data-collection/Sm_HEME-KCN/*.tiff
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 309 MB] Importing 3461 files
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] Import paths were unique at level -1
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] Importing 3462 files
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] 'Skip Header Check' parameter enabled, checking first header only
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 311 MB] Reading headers of gain reference file /mnt/parashar/LBMS/July_2024_Data-collection/Sm_CDA/Gain_20240708_105kx.mrc
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Done importing.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] ===========================================================
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Loaded 3461 movies.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]   Common fields:
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                mscope_params/accel_kv :  {300.0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                   mscope_params/cs_mm :  {2.7}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]     mscope_params/total_dose_e_per_A2 :  {60.0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]            mscope_params/exp_group_id :  {25}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]             mscope_params/phase_plate :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]               mscope_params/neg_stain :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                    movie_blob/psize_A :  {0.4125}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                      movie_blob/shape :  [   50  8184 11520]
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]          movie_blob/is_gain_corrected :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] ===========================================================
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Making example plots. Exposures will be displayed without defect correction.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Reading file...
[Sat, 13 Jul 2024 19:25:38 GMT]  Raw data J511/imported/010048681913291041280_FoilHole_8936734_Data_8936159_8936161_20240709_124451_fractions.tiff
[Sat, 13 Jul 2024 19:25:40 GMT] [CPU RAM used: 1864 MB] Reading file...
[Sat, 13 Jul 2024 19:25:51 GMT]  Raw data J511/imported/009562861221805786508_FoilHole_8936735_Data_8936159_8936161_20240709_124502_fractions.tiff
[Sat, 13 Jul 2024 19:25:51 GMT] [CPU RAM used: 1896 MB] Reading file...
[Sat, 13 Jul 2024 19:26:03 GMT]  Raw data J511/imported/013620802636650892182_FoilHole_8936737_Data_8936159_8936161_20240709_125020_fractions.tiff
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] Done.
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] Compiling job outputs...
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 404 MB] Updating job size...
[Sat, 13 Jul 2024 19:26:04 GMT] [CPU RAM used: 404 MB] Exporting job and creating csg files...
[Sat, 13 Jul 2024 19:26:05 GMT] [CPU RAM used: 404 MB] ***************************************************************
[Sat, 13 Jul 2024 19:26:05 GMT] [CPU RAM used: 404 MB] Job complete. Total time 44.92s

@Sdk My apologies for asking the wrong questions in my earlier posts. I should have asked:
Please can you post the outputs of these commands

cryosparcm eventlog P99 J199 | head -n 40
cryosparcm joblog P99 J199 | tail -n 40
cryosparcm cli "get_job('P99', 'J199', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"

where you replace P99, J199 with the failed patch motion correction job’s project and job IDs, respectively.

Here we go:

[csparc@biomix ~]$ cryosparcm eventlog P5 J515 | head -n 40
[Sun, 14 Jul 2024 14:57:30 GMT]  License is valid.
[Sun, 14 Jul 2024 14:57:30 GMT]  Launching job on lane biomix target biomix ...
[Sun, 14 Jul 2024 14:57:30 GMT]  Launching job on cluster biomix
[Sun, 14 Jul 2024 14:57:30 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P5_J515
#SBATCH --partition=cryosparc
#SBATCH --output=/mnt/parashar/cspark_files/CS-sm74/J515/job.log
#SBATCH --error=/mnt/parashar/cspark_files/CS-sm74/J515/job.log
#SBATCH --nodes=1
#SBATCH --mem=16000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
srun /usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw run --project P5 --job J515 --master_hostname biomix.dbi.udel.edu --master_command_core_port 39002 > /mnt/parashar/cspark_files/CS-sm74/J515/job.log 2>&1
==========================================================================
==========================================================================
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Submission command:
sbatch /mnt/parashar/cspark_files/CS-sm74/J515/queue_sub_script.sh
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Cluster Job ID:
679604
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Queued on cluster at 2024-07-14 10:57:30.657604
[Sun, 14 Jul 2024 14:57:31 GMT]  -------- Cluster job status at 2024-07-14 10:57:31.069857 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            679604 cryosparc cryospar   csparc  R       0:01      1 biomix10
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Job J515 Started
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Working in directory: /mnt/parashar/cspark_files/CS-sm74/J515
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Running on lane biomix
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   Worker:  biomix
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   CPU   :  [0, 1, 2, 3, 4, 5]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   GPU   :  [0]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   RAM   :  [0, 1]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Importing job module for job type patch_motion_correction_multi...
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe

command 2

[csparc@biomix ~]$ cryosparcm joblog P5 J515 | tail -n 40  
========= sending heartbeat at 2024-07-14 22:02:27.167072
========= sending heartbeat at 2024-07-14 22:02:37.182244
========= sending heartbeat at 2024-07-14 22:02:47.198172
========= sending heartbeat at 2024-07-14 22:02:57.215540
========= sending heartbeat at 2024-07-14 22:03:07.230151
========= sending heartbeat at 2024-07-14 22:03:17.239818
========= sending heartbeat at 2024-07-14 22:03:27.253724
========= sending heartbeat at 2024-07-14 22:03:37.271657
========= sending heartbeat at 2024-07-14 22:03:47.289765
========= sending heartbeat at 2024-07-14 22:03:57.307739
========= sending heartbeat at 2024-07-14 22:04:07.326268
========= sending heartbeat at 2024-07-14 22:04:17.345515
========= sending heartbeat at 2024-07-14 22:04:27.362153
========= sending heartbeat at 2024-07-14 22:04:37.380736
========= sending heartbeat at 2024-07-14 22:04:47.399128
========= sending heartbeat at 2024-07-14 22:04:57.416897
========= sending heartbeat at 2024-07-14 22:05:07.434426
========= sending heartbeat at 2024-07-14 22:05:17.450146
========= sending heartbeat at 2024-07-14 22:05:27.467578
========= sending heartbeat at 2024-07-14 22:05:37.484845
========= sending heartbeat at 2024-07-14 22:05:47.502330
========= sending heartbeat at 2024-07-14 22:05:57.520148
========= sending heartbeat at 2024-07-14 22:06:07.537891
========= sending heartbeat at 2024-07-14 22:06:17.556618
========= sending heartbeat at 2024-07-14 22:06:27.574242
========= sending heartbeat at 2024-07-14 22:06:37.590242
========= sending heartbeat at 2024-07-14 22:06:47.608114
========= sending heartbeat at 2024-07-14 22:06:57.626769
========= sending heartbeat at 2024-07-14 22:07:07.645911
========= sending heartbeat at 2024-07-14 22:07:17.665957
  ========= heartbeat failed at 2024-07-14 22:07:17.674537:
========= sending heartbeat at 2024-07-14 22:07:27.684655
  ========= heartbeat failed at 2024-07-14 22:07:27.692649:
========= sending heartbeat at 2024-07-14 22:07:37.702753
  ========= heartbeat failed at 2024-07-14 22:07:37.710694:
 ************* Connection to cryosparc command lost. Heartbeat failed 3 consecutive times at 2024-07-14 22:07:37.710743.
/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 1116721 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"
slurmstepd-biomix10: error: Detected 1 oom-kill event(s) in StepId=679604.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: biomix10: task 0: Out Of Memory
slurmstepd-biomix10: error: Detected 1 oom-kill event(s) in StepId=679604.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

command 3

[csparc@biomix ~]$ cryosparcm cli "get_job('P5', 'J515', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"
{'_id': '6693e74d03031811dbc0e16c', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '247.14GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4410Y', 'driver_version': '12.3', 'gpu_info': [{'id': 0, 'mem': 47810936832, 'name': 'NVIDIA L40S', 'pcie': '0000:3d:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 24, 'platform_architecture': 'x86_64', 'platform_node': 'biomix10', 'platform_release': '5.15.0-105-generic', 'platform_version': '#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024', 'total_memory': '251.55GB', 'used_memory': '1.87GB'}, 'job_type': 'patch_motion_correction_multi', 'params_spec': {'output_fcrop_factor': {'value': '1/2'}}, 'project_uid': 'P5', 'status': 'completed', 'uid': 'J515', 'version': 'v4.5.3'}

command 4

[csparc@biomix ~]$ cryosparcm eventlog P5 J515 | tail -n 40
        Writing background estimate to J515/motioncorrected/010265144360898815905_FoilHole_8956123_Data_8936159_8936161_20240709_144445_fractions_background.mrc ...
        Done in 0.04s
        Writing motion estimates...
        Done in 0.01s
[Sun, 14 Jul 2024 22:06:34 GMT] [CPU RAM used: 10099 MB] -- 0.0: processing 1580 of 3461: J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff
        loading /mnt/parashar/cspark_files/CS-sm74/J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff
        Loading raw movie data from J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff ...
        Done in 4.98s
        Loading gain data from J511/imported/Gain_20240708_105kx.mrc ...
        Done in 0.00s
        Processing ...
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 382 MB] Child process with PID 1116753 terminated unexpectedly with exit code -9.
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 382 MB] ['uid', 'movie_blob/path', 'movie_blob/shape', 'movie_blob/psize_A', 'movie_blob/is_gain_corrected', 'movie_blob/format', 'movie_blob/has_defect_file', 'movie_blob/import_sig', 'micrograph_blob/path', 'micrograph_blob/idx', 'micrograph_blob/shape', 'micrograph_blob/psize_A', 'micrograph_blob/format', 'micrograph_blob/is_background_subtracted', 'micrograph_blob/vmin', 'micrograph_blob/vmax', 'micrograph_blob/import_sig', 'micrograph_blob_non_dw/path', 'micrograph_blob_non_dw/idx', 'micrograph_blob_non_dw/shape', 'micrograph_blob_non_dw/psize_A', 'micrograph_blob_non_dw/format', 'micrograph_blob_non_dw/is_background_subtracted', 'micrograph_blob_non_dw/vmin', 'micrograph_blob_non_dw/vmax', 'micrograph_blob_non_dw/import_sig', 'micrograph_blob_non_dw_AB/path', 'micrograph_blob_non_dw_AB/idx', 'micrograph_blob_non_dw_AB/shape', 'micrograph_blob_non_dw_AB/psize_A', 'micrograph_blob_non_dw_AB/format', 'micrograph_blob_non_dw_AB/is_background_subtracted', 'micrograph_blob_non_dw_AB/vmin', 'micrograph_blob_non_dw_AB/vmax', 'micrograph_blob_non_dw_AB/import_sig', 'micrograph_thumbnail_blob_1x/path', 'micrograph_thumbnail_blob_1x/idx', 'micrograph_thumbnail_blob_1x/shape', 'micrograph_thumbnail_blob_1x/format', 'micrograph_thumbnail_blob_1x/binfactor', 'micrograph_thumbnail_blob_1x/micrograph_path', 'micrograph_thumbnail_blob_1x/vmin', 'micrograph_thumbnail_blob_1x/vmax', 'micrograph_thumbnail_blob_2x/path', 'micrograph_thumbnail_blob_2x/idx', 'micrograph_thumbnail_blob_2x/shape', 'micrograph_thumbnail_blob_2x/format', 'micrograph_thumbnail_blob_2x/binfactor', 'micrograph_thumbnail_blob_2x/micrograph_path', 'micrograph_thumbnail_blob_2x/vmin', 'micrograph_thumbnail_blob_2x/vmax', 'background_blob/path', 'background_blob/idx', 'background_blob/binfactor', 'background_blob/shape', 'background_blob/psize_A', 'rigid_motion/type', 'rigid_motion/path', 'rigid_motion/idx', 'rigid_motion/frame_start', 'rigid_motion/frame_end', 'rigid_motion/zero_shift_frame', 'rigid_motion/psize_A', 'spline_motion/type', 'spline_motion/path', 'spline_motion/idx', 'spline_motion/frame_start', 'spline_motion/frame_end', 'spline_motion/zero_shift_frame', 'spline_motion/psize_A']
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] --------------------------------------------------------------
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] Compiling job outputs...
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] Passing through outputs for output group micrographs from input group movies
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB] This job outputted results ['micrograph_blob_non_dw', 'micrograph_blob_non_dw_AB', 'micrograph_thumbnail_blob_1x', 'micrograph_thumbnail_blob_2x', 'movie_blob', 'micrograph_blob', 'background_blob', 'rigid_motion', 'spline_motion']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB]   Loaded output dset with 1579 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB] Passthrough results ['gain_ref_blob', 'mscope_params']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded passthrough dset with 3461 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Intersection of output and passthrough has 1579 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Output dataset contains:  ['mscope_params', 'gain_ref_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result gain_ref_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result mscope_params
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Passing through outputs for output group micrographs_incomplete from input group movies
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] This job outputted results ['micrograph_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded output dset with 1882 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Passthrough results ['movie_blob', 'gain_ref_blob', 'mscope_params']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded passthrough dset with 3461 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Intersection of output and passthrough has 1882 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Output dataset contains:  ['mscope_params', 'gain_ref_blob', 'movie_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result movie_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result gain_ref_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result mscope_params
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Checking outputs for output group micrographs
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Checking outputs for output group micrographs_incomplete
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Updating job size...
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] Exporting job and creating csg files...
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] ***************************************************************
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] Job complete. Total time 25774.07s

The slurm job may have been terminated because it used more RAM than requested.
Your current slurm script template likely includes a specification

#SBATCH --mem={{ (ram_gb * 1000) | int }}M

or similar, where ram_gb is a job type-specific estimate that may underestimate actual RAM usage for a particular combination of input data and job parameters.
To confirm this, you could dump and inspect the script template

# on biomix.dbi.udel.edu
mkdir /tmp/biomix.scripts
cd /tmp/biomix.scripts/
cryosparcm cluster dump biomix
grep mem cluster_script.sh

You could modify the script template using a constant

#SBATCH --mem={{ (ram_gb * 1000 * 2) | int }}M

or variable

#SBATCH --mem={{ (ram_gb * 1000 * my_ram_multiplier) | int }}M

multiplier where the variable my_ram_multiplier would have to be defined as a cluster custom variable. The latter (variable) approach can help avoid a scenario where a constant value that is applied to all job submissions would in some cases cause more RAM to be requested than necessary for a given job. That scenario could lead to:

  • an unnecessary delay in running the job because the job needs to wait longer for a larger RAM configuration even if more RAM is not actually needed
  • a job unnecessarily reserving RAM that therefore would be unavailable to other jobs

The modified script template needs to be uploaded to the CryoSPARC database using the command
cryosparcm cluster connect.
Caution: That command overwrites an existing configuration unless a unique "name": is defined inside the cluster_info.json file.

2 Likes

Thank you so much! I changed the ram multiplier and now it is working!