Child process with PID xxxxxxx terminated unexpectedly with exit code -9

Sdk · July 15, 2024, 1:34pm

Hi all,

I was running Patch motion correction and it just stopped in the middle and only corrected roughly half of the micrographs. The error shows ‘Child process with PID 1116753 terminated unexpectedly with exit code -9’. I tried to skip that particular micrograph and it corrects just a couple more micrographs before it stops with the same error message. Can anyone tell me how to fix this?

Cryosparc : V4.5.3
movies are in tiff format.

wtempel · July 15, 2024, 9:45pm

Welcome to the forum @Sdk .
Please can you post the outputs of these commands

cryosparcm eventlog P99 J199 | head -n 40
cryosparcm joblog P99 J199 | tail -n 40
cryosparcm cli "get_job('P99', 'J199', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"

where you replace P99, J199 with the import job’s project and job IDs, respectively.

Sdk · July 16, 2024, 2:32pm

Thank you so much for your reply!

Here is what I got:

[csparc@biomix ~]$ cryosparcm eventlog P5 J511 | head -n 40
[Sat, 13 Jul 2024 19:25:13 GMT]  License is valid.
[Sat, 13 Jul 2024 19:25:13 GMT]  Launching job on lane biomix target biomix ...
[Sat, 13 Jul 2024 19:25:13 GMT]  Launching job on cluster biomix
[Sat, 13 Jul 2024 19:25:13 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P5_J511
#SBATCH --partition=cryosparc
#SBATCH --output=/mnt/parashar/cspark_files/CS-sm74/J511/job.log
#SBATCH --error=/mnt/parashar/cspark_files/CS-sm74/J511/job.log
#SBATCH --nodes=1
#SBATCH --mem=24000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:0
#SBATCH --gres-flags=enforce-binding
srun /usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw run --project P5 --job J511 --master_hostname biomix.dbi.udel.edu --master_command_core_port 39002 > /mnt/parashar/cspark_files/CS-sm74/J511/job.log 2>&1 
==========================================================================
==========================================================================
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Submission command: 
sbatch /mnt/parashar/cspark_files/CS-sm74/J511/queue_sub_script.sh
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Cluster Job ID: 
679600
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Queued on cluster at 2024-07-13 15:25:13.622264
[Sat, 13 Jul 2024 19:25:13 GMT]  -------- Cluster job status at 2024-07-13 15:25:13.638782 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            679600 cryosparc cryospar   csparc PD       0:00      1 (None)
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Job J511 Started
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Working in directory: /mnt/parashar/cspark_files/CS-sm74/J511
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Running on lane biomix
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   Worker:  biomix
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   CPU   :  [0]
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   GPU   :  []
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   RAM   :  [0, 1, 2]
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Importing job module for job type import_movies...


[csparc@biomix ~]$ cryosparcm joblog P5 J511 | tail -n 40  
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/usr/local/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/libpython3.10.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.


================= CRYOSPARCW =======  2024-07-13 19:25:14.729172  =========
Project P5 Job J511
Master biomix.dbi.udel.edu Port 39002
===========================================================================
MAIN PROCESS PID 1114276
========= now starting main process at 2024-07-13 19:25:14.729602
MONITOR PROCESS PID 1114278
========= monitor process now waiting for main process
========= sending heartbeat at 2024-07-13 19:25:15.700782
imports.run cryosparc_compute.jobs.jobregister
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/usr/localMAIN/cryosparc/cryosparc_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
========= sending heartbeat at 2024-07-13 19:25:25.714150
========= sending heartbeat at 2024-07-13 19:25:35.730154
========= sending heartbeat at 2024-07-13 19:25:45.746151
========= sending heartbeat at 2024-07-13 19:25:55.762149
***************************************************************
min: 1020.934750 max: 1277.435368
min: 15255.508331 max: 17065.165497
min: 438.812332 max: 589.635666
min: 92.065414 max: 165.044571
min: 1030.482346 max: 1277.246902
min: 1042.400784 max: 1273.001316
***************************************************************
========= main process now complete at 2024-07-13 19:26:05.021916
========= sending heartbeat at 2024-07-13 19:26:05.778165
  ========= heartbeat failed at 2024-07-13 19:26:05.786042: 


[csparc@biomix ~]$ cryosparcm cli "get_job('P5', 'J511', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"
{'_id': '6692d3fe03031811dbb40d88', 'instance_information': {'available_memory': '247.11GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4410Y', 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 24, 'platform_architecture': 'x86_64', 'platform_node': 'biomix10', 'platform_release': '5.15.0-105-generic', 'platform_version': '#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024', 'total_memory': '251.55GB', 'used_memory': '1.89GB'}, 'job_type': 'import_movies', 'params_spec': {'accel_kv': {'value': 300}, 'blob_paths': {'value': '/mnt/parashar/LBMS/July_2024_Data-collection/Sm_HEME-KCN/*.tiff'}, 'cs_mm': {'value': 2.7}, 'gainref_path': {'value': '/mnt/parashar/LBMS/July_2024_Data-collection/Sm_CDA/Gain_20240708_105kx.mrc'}, 'psize_A': {'value': 0.4125}, 'total_dose_e_per_A2': {'value': 60}}, 'project_uid': 'P5', 'status': 'completed', 'uid': 'J511', 'version': 'v4.5.3'}

wtempel · July 16, 2024, 3:02pm

@Sdk Thanks for posting the information. May I additionally ask for the output of the command

cryosparcm eventlog P5 J511 | tail -n 40

Sdk · July 16, 2024, 6:02pm

[csparc@biomix ~]$ cryosparcm eventlog P5 J511 | tail -n 40
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:15 GMT] [CPU RAM used: 92 MB] Importing job module for job type import_movies...
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] Job ready to run
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] ***************************************************************
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 308 MB] Importing movies from /mnt/parashar/LBMS/July_2024_Data-collection/Sm_HEME-KCN/*.tiff
[Sat, 13 Jul 2024 19:25:20 GMT] [CPU RAM used: 309 MB] Importing 3461 files
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] Import paths were unique at level -1
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] Importing 3462 files
[Sat, 13 Jul 2024 19:25:24 GMT] [CPU RAM used: 311 MB] 'Skip Header Check' parameter enabled, checking first header only
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 311 MB] Reading headers of gain reference file /mnt/parashar/LBMS/July_2024_Data-collection/Sm_CDA/Gain_20240708_105kx.mrc
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Done importing.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] ===========================================================
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Loaded 3461 movies.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]   Common fields:
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                mscope_params/accel_kv :  {300.0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                   mscope_params/cs_mm :  {2.7}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]     mscope_params/total_dose_e_per_A2 :  {60.0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]            mscope_params/exp_group_id :  {25}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]             mscope_params/phase_plate :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]               mscope_params/neg_stain :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                    movie_blob/psize_A :  {0.4125}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]                      movie_blob/shape :  [   50  8184 11520]
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB]          movie_blob/is_gain_corrected :  {0}
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] ===========================================================
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Making example plots. Exposures will be displayed without defect correction.
[Sat, 13 Jul 2024 19:25:25 GMT] [CPU RAM used: 314 MB] Reading file...
[Sat, 13 Jul 2024 19:25:38 GMT]  Raw data J511/imported/010048681913291041280_FoilHole_8936734_Data_8936159_8936161_20240709_124451_fractions.tiff
[Sat, 13 Jul 2024 19:25:40 GMT] [CPU RAM used: 1864 MB] Reading file...
[Sat, 13 Jul 2024 19:25:51 GMT]  Raw data J511/imported/009562861221805786508_FoilHole_8936735_Data_8936159_8936161_20240709_124502_fractions.tiff
[Sat, 13 Jul 2024 19:25:51 GMT] [CPU RAM used: 1896 MB] Reading file...
[Sat, 13 Jul 2024 19:26:03 GMT]  Raw data J511/imported/013620802636650892182_FoilHole_8936737_Data_8936159_8936161_20240709_125020_fractions.tiff
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] Done.
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] --------------------------------------------------------------
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 1926 MB] Compiling job outputs...
[Sat, 13 Jul 2024 19:26:03 GMT] [CPU RAM used: 404 MB] Updating job size...
[Sat, 13 Jul 2024 19:26:04 GMT] [CPU RAM used: 404 MB] Exporting job and creating csg files...
[Sat, 13 Jul 2024 19:26:05 GMT] [CPU RAM used: 404 MB] ***************************************************************
[Sat, 13 Jul 2024 19:26:05 GMT] [CPU RAM used: 404 MB] Job complete. Total time 44.92s

wtempel · July 16, 2024, 6:43pm

@Sdk My apologies for asking the wrong questions in my earlier posts. I should have asked:
Please can you post the outputs of these commands

cryosparcm eventlog P99 J199 | head -n 40
cryosparcm joblog P99 J199 | tail -n 40
cryosparcm cli "get_job('P99', 'J199', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"

where you replace P99, J199 with the failed patch motion correction job’s project and job IDs, respectively.

Sdk · July 16, 2024, 7:08pm

Here we go:

[csparc@biomix ~]$ cryosparcm eventlog P5 J515 | head -n 40
[Sun, 14 Jul 2024 14:57:30 GMT]  License is valid.
[Sun, 14 Jul 2024 14:57:30 GMT]  Launching job on lane biomix target biomix ...
[Sun, 14 Jul 2024 14:57:30 GMT]  Launching job on cluster biomix
[Sun, 14 Jul 2024 14:57:30 GMT]  
====================== Cluster submission script: ========================
==========================================================================
#!/bin/bash
#SBATCH --job-name=cryosparc_P5_J515
#SBATCH --partition=cryosparc
#SBATCH --output=/mnt/parashar/cspark_files/CS-sm74/J515/job.log
#SBATCH --error=/mnt/parashar/cspark_files/CS-sm74/J515/job.log
#SBATCH --nodes=1
#SBATCH --mem=16000M
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
srun /usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw run --project P5 --job J515 --master_hostname biomix.dbi.udel.edu --master_command_core_port 39002 > /mnt/parashar/cspark_files/CS-sm74/J515/job.log 2>&1
==========================================================================
==========================================================================
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Submission command:
sbatch /mnt/parashar/cspark_files/CS-sm74/J515/queue_sub_script.sh
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Cluster Job ID:
679604
[Sun, 14 Jul 2024 14:57:30 GMT]  -------- Queued on cluster at 2024-07-14 10:57:30.657604
[Sun, 14 Jul 2024 14:57:31 GMT]  -------- Cluster job status at 2024-07-14 10:57:31.069857 (0 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            679604 cryosparc cryospar   csparc  R       0:01      1 biomix10
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Job J515 Started
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Master running v4.5.3, worker running v4.5.3
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Working in directory: /mnt/parashar/cspark_files/CS-sm74/J515
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Running on lane biomix
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Resources allocated:
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   Worker:  biomix
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   CPU   :  [0, 1, 2, 3, 4, 5]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   GPU   :  [0]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   RAM   :  [0, 1]
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB]   SSD   :  False
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] --------------------------------------------------------------
[Sun, 14 Jul 2024 14:57:32 GMT] [CPU RAM used: 92 MB] Importing job module for job type patch_motion_correction_multi...
Traceback (most recent call last):
  File "<string>", line 9, in <module>
BrokenPipeError: [Errno 32] Broken pipe

command 2

[csparc@biomix ~]$ cryosparcm joblog P5 J515 | tail -n 40  
========= sending heartbeat at 2024-07-14 22:02:27.167072
========= sending heartbeat at 2024-07-14 22:02:37.182244
========= sending heartbeat at 2024-07-14 22:02:47.198172
========= sending heartbeat at 2024-07-14 22:02:57.215540
========= sending heartbeat at 2024-07-14 22:03:07.230151
========= sending heartbeat at 2024-07-14 22:03:17.239818
========= sending heartbeat at 2024-07-14 22:03:27.253724
========= sending heartbeat at 2024-07-14 22:03:37.271657
========= sending heartbeat at 2024-07-14 22:03:47.289765
========= sending heartbeat at 2024-07-14 22:03:57.307739
========= sending heartbeat at 2024-07-14 22:04:07.326268
========= sending heartbeat at 2024-07-14 22:04:17.345515
========= sending heartbeat at 2024-07-14 22:04:27.362153
========= sending heartbeat at 2024-07-14 22:04:37.380736
========= sending heartbeat at 2024-07-14 22:04:47.399128
========= sending heartbeat at 2024-07-14 22:04:57.416897
========= sending heartbeat at 2024-07-14 22:05:07.434426
========= sending heartbeat at 2024-07-14 22:05:17.450146
========= sending heartbeat at 2024-07-14 22:05:27.467578
========= sending heartbeat at 2024-07-14 22:05:37.484845
========= sending heartbeat at 2024-07-14 22:05:47.502330
========= sending heartbeat at 2024-07-14 22:05:57.520148
========= sending heartbeat at 2024-07-14 22:06:07.537891
========= sending heartbeat at 2024-07-14 22:06:17.556618
========= sending heartbeat at 2024-07-14 22:06:27.574242
========= sending heartbeat at 2024-07-14 22:06:37.590242
========= sending heartbeat at 2024-07-14 22:06:47.608114
========= sending heartbeat at 2024-07-14 22:06:57.626769
========= sending heartbeat at 2024-07-14 22:07:07.645911
========= sending heartbeat at 2024-07-14 22:07:17.665957
  ========= heartbeat failed at 2024-07-14 22:07:17.674537:
========= sending heartbeat at 2024-07-14 22:07:27.684655
  ========= heartbeat failed at 2024-07-14 22:07:27.692649:
========= sending heartbeat at 2024-07-14 22:07:37.702753
  ========= heartbeat failed at 2024-07-14 22:07:37.710694:
 ************* Connection to cryosparc command lost. Heartbeat failed 3 consecutive times at 2024-07-14 22:07:37.710743.
/usr/localMAIN/cryosparc/cryosparc_worker/bin/cryosparcw: line 150: 1116721 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"
slurmstepd-biomix10: error: Detected 1 oom-kill event(s) in StepId=679604.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: biomix10: task 0: Out Of Memory
slurmstepd-biomix10: error: Detected 1 oom-kill event(s) in StepId=679604.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

command 3

[csparc@biomix ~]$ cryosparcm cli "get_job('P5', 'J515', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"
{'_id': '6693e74d03031811dbc0e16c', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '247.14GB', 'cpu_model': 'Intel(R) Xeon(R) Silver 4410Y', 'driver_version': '12.3', 'gpu_info': [{'id': 0, 'mem': 47810936832, 'name': 'NVIDIA L40S', 'pcie': '0000:3d:00'}], 'ofd_hard_limit': 131072, 'ofd_soft_limit': 1024, 'physical_cores': 24, 'platform_architecture': 'x86_64', 'platform_node': 'biomix10', 'platform_release': '5.15.0-105-generic', 'platform_version': '#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024', 'total_memory': '251.55GB', 'used_memory': '1.87GB'}, 'job_type': 'patch_motion_correction_multi', 'params_spec': {'output_fcrop_factor': {'value': '1/2'}}, 'project_uid': 'P5', 'status': 'completed', 'uid': 'J515', 'version': 'v4.5.3'}

command 4

[csparc@biomix ~]$ cryosparcm eventlog P5 J515 | tail -n 40
        Writing background estimate to J515/motioncorrected/010265144360898815905_FoilHole_8956123_Data_8936159_8936161_20240709_144445_fractions_background.mrc ...
        Done in 0.04s
        Writing motion estimates...
        Done in 0.01s
[Sun, 14 Jul 2024 22:06:34 GMT] [CPU RAM used: 10099 MB] -- 0.0: processing 1580 of 3461: J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff
        loading /mnt/parashar/cspark_files/CS-sm74/J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff
        Loading raw movie data from J511/imported/012911121486586445851_FoilHole_8956124_Data_8936159_8936161_20240709_144456_fractions.tiff ...
        Done in 4.98s
        Loading gain data from J511/imported/Gain_20240708_105kx.mrc ...
        Done in 0.00s
        Processing ...
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 382 MB] Child process with PID 1116753 terminated unexpectedly with exit code -9.
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 382 MB] ['uid', 'movie_blob/path', 'movie_blob/shape', 'movie_blob/psize_A', 'movie_blob/is_gain_corrected', 'movie_blob/format', 'movie_blob/has_defect_file', 'movie_blob/import_sig', 'micrograph_blob/path', 'micrograph_blob/idx', 'micrograph_blob/shape', 'micrograph_blob/psize_A', 'micrograph_blob/format', 'micrograph_blob/is_background_subtracted', 'micrograph_blob/vmin', 'micrograph_blob/vmax', 'micrograph_blob/import_sig', 'micrograph_blob_non_dw/path', 'micrograph_blob_non_dw/idx', 'micrograph_blob_non_dw/shape', 'micrograph_blob_non_dw/psize_A', 'micrograph_blob_non_dw/format', 'micrograph_blob_non_dw/is_background_subtracted', 'micrograph_blob_non_dw/vmin', 'micrograph_blob_non_dw/vmax', 'micrograph_blob_non_dw/import_sig', 'micrograph_blob_non_dw_AB/path', 'micrograph_blob_non_dw_AB/idx', 'micrograph_blob_non_dw_AB/shape', 'micrograph_blob_non_dw_AB/psize_A', 'micrograph_blob_non_dw_AB/format', 'micrograph_blob_non_dw_AB/is_background_subtracted', 'micrograph_blob_non_dw_AB/vmin', 'micrograph_blob_non_dw_AB/vmax', 'micrograph_blob_non_dw_AB/import_sig', 'micrograph_thumbnail_blob_1x/path', 'micrograph_thumbnail_blob_1x/idx', 'micrograph_thumbnail_blob_1x/shape', 'micrograph_thumbnail_blob_1x/format', 'micrograph_thumbnail_blob_1x/binfactor', 'micrograph_thumbnail_blob_1x/micrograph_path', 'micrograph_thumbnail_blob_1x/vmin', 'micrograph_thumbnail_blob_1x/vmax', 'micrograph_thumbnail_blob_2x/path', 'micrograph_thumbnail_blob_2x/idx', 'micrograph_thumbnail_blob_2x/shape', 'micrograph_thumbnail_blob_2x/format', 'micrograph_thumbnail_blob_2x/binfactor', 'micrograph_thumbnail_blob_2x/micrograph_path', 'micrograph_thumbnail_blob_2x/vmin', 'micrograph_thumbnail_blob_2x/vmax', 'background_blob/path', 'background_blob/idx', 'background_blob/binfactor', 'background_blob/shape', 'background_blob/psize_A', 'rigid_motion/type', 'rigid_motion/path', 'rigid_motion/idx', 'rigid_motion/frame_start', 'rigid_motion/frame_end', 'rigid_motion/zero_shift_frame', 'rigid_motion/psize_A', 'spline_motion/type', 'spline_motion/path', 'spline_motion/idx', 'spline_motion/frame_start', 'spline_motion/frame_end', 'spline_motion/zero_shift_frame', 'spline_motion/psize_A']
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] --------------------------------------------------------------
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] Compiling job outputs...
[Sun, 14 Jul 2024 22:07:04 GMT] [CPU RAM used: 380 MB] Passing through outputs for output group micrographs from input group movies
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB] This job outputted results ['micrograph_blob_non_dw', 'micrograph_blob_non_dw_AB', 'micrograph_thumbnail_blob_1x', 'micrograph_thumbnail_blob_2x', 'movie_blob', 'micrograph_blob', 'background_blob', 'rigid_motion', 'spline_motion']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB]   Loaded output dset with 1579 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 380 MB] Passthrough results ['gain_ref_blob', 'mscope_params']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded passthrough dset with 3461 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Intersection of output and passthrough has 1579 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Output dataset contains:  ['mscope_params', 'gain_ref_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result gain_ref_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result mscope_params
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Passing through outputs for output group micrographs_incomplete from input group movies
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] This job outputted results ['micrograph_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded output dset with 1882 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Passthrough results ['movie_blob', 'gain_ref_blob', 'mscope_params']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Loaded passthrough dset with 3461 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Intersection of output and passthrough has 1882 items
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Output dataset contains:  ['mscope_params', 'gain_ref_blob', 'movie_blob']
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result movie_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result gain_ref_blob
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB]   Outputting passthrough result mscope_params
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Checking outputs for output group micrographs
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Checking outputs for output group micrographs_incomplete
[Sun, 14 Jul 2024 22:07:05 GMT] [CPU RAM used: 381 MB] Updating job size...
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] Exporting job and creating csg files...
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] ***************************************************************
[Sun, 14 Jul 2024 22:07:10 GMT] [CPU RAM used: 382 MB] Job complete. Total time 25774.07s

wtempel · July 22, 2024, 7:05pm

The slurm job may have been terminated because it used more RAM than requested.
Your current slurm script template likely includes a specification

#SBATCH --mem={{ (ram_gb * 1000) | int }}M

or similar, where ram_gb is a job type-specific estimate that may underestimate actual RAM usage for a particular combination of input data and job parameters.
To confirm this, you could dump and inspect the script template

# on biomix.dbi.udel.edu
mkdir /tmp/biomix.scripts
cd /tmp/biomix.scripts/
cryosparcm cluster dump biomix
grep mem cluster_script.sh

You could modify the script template using a constant

#SBATCH --mem={{ (ram_gb * 1000 * 2) | int }}M

or variable

#SBATCH --mem={{ (ram_gb * 1000 * my_ram_multiplier) | int }}M

multiplier where the variable my_ram_multiplier would have to be defined as a cluster custom variable. The latter (variable) approach can help avoid a scenario where a constant value that is applied to all job submissions would in some cases cause more RAM to be requested than necessary for a given job. That scenario could lead to:

an unnecessary delay in running the job because the job needs to wait longer for a larger RAM configuration even if more RAM is not actually needed
a job unnecessarily reserving RAM that therefore would be unavailable to other jobs

The modified script template needs to be uploaded to the CryoSPARC database using the command
cryosparcm cluster connect.
Caution: That command overwrites an existing configuration unless a unique "name": is defined inside the cluster_info.json file.

Sdk · July 22, 2024, 9:18pm

Thank you so much! I changed the ram multiplier and now it is working!

DavidObe · January 21, 2025, 4:41pm

Hi everyone,

I am also having issue with Patch motion correction after importing my images. I have tried all the available version of cryoSPARCs (4.4, 4.5 and recently 4.6). Here are the output I got from the codes:

[davidobe@c0910 ~]$ cryosparcm eventlog P4 J85 | tail -n 40
        Writing 240x240 micrograph thumbnail to J85/thumbnails/001337770659252522731_FoilHole_29642160_Data_27198267_12_20241127_200859_EER_thumb_@2x.png ...
        Done in 0.00s
        Writing dose-weighted result to J85/motioncorrected/001337770659252522731_FoilHole_29642160_Data_27198267_12_20241127_200859_EER_patch_aligned_doseweighted.mrc ...
        Done in 0.06s
        Writing background estimate to J85/motioncorrected/001337770659252522731_FoilHole_29642160_Data_27198267_12_20241127_200859_EER_background.mrc ...
        Done in 0.00s
        Writing motion estimates...
        Done in 0.00s
[Tue, 21 Jan 2025 00:42:04 GMT] [CPU RAM used: 2973 MB] -- 1.0: processing 1939 of 12315: J8/imported/007958183696058118927_FoilHole_29642161_Data_27198256_16_20241127_200928_EER.eer
        loading /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J8/imported/007958183696058118927_FoilHole_29642161_Data_27198256_16_20241127_200928_EER.eer
        Loading raw movie data from J8/imported/007958183696058118927_FoilHole_29642161_Data_27198256_16_20241127_200928_EER.eer ...
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 482 MB] Child process with PID 311525 terminated unexpectedly with exit code 1.
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 482 MB] ['uid', 'movie_blob/path', 'movie_blob/shape', 'movie_blob/psize_A', 'movie_blob/is_gain_corrected', 'movie_blob/format', 'movie_blob/has_defect_file', 'movie_blob/import_sig', 'micrograph_blob/path', 'micrograph_blob/idx', 'micrograph_blob/shape', 'micrograph_blob/psize_A', 'micrograph_blob/format', 'micrograph_blob/is_background_subtracted', 'micrograph_blob/vmin', 'micrograph_blob/vmax', 'micrograph_blob/import_sig', 'micrograph_blob_non_dw/path', 'micrograph_blob_non_dw/idx', 'micrograph_blob_non_dw/shape', 'micrograph_blob_non_dw/psize_A', 'micrograph_blob_non_dw/format', 'micrograph_blob_non_dw/is_background_subtracted', 'micrograph_blob_non_dw/vmin', 'micrograph_blob_non_dw/vmax', 'micrograph_blob_non_dw/import_sig', 'micrograph_blob_non_dw_AB/path', 'micrograph_blob_non_dw_AB/idx', 'micrograph_blob_non_dw_AB/shape', 'micrograph_blob_non_dw_AB/psize_A', 'micrograph_blob_non_dw_AB/format', 'micrograph_blob_non_dw_AB/is_background_subtracted', 'micrograph_blob_non_dw_AB/vmin', 'micrograph_blob_non_dw_AB/vmax', 'micrograph_blob_non_dw_AB/import_sig', 'micrograph_thumbnail_blob_1x/path', 'micrograph_thumbnail_blob_1x/idx', 'micrograph_thumbnail_blob_1x/shape', 'micrograph_thumbnail_blob_1x/format', 'micrograph_thumbnail_blob_1x/binfactor', 'micrograph_thumbnail_blob_1x/micrograph_path', 'micrograph_thumbnail_blob_1x/vmin', 'micrograph_thumbnail_blob_1x/vmax', 'micrograph_thumbnail_blob_2x/path', 'micrograph_thumbnail_blob_2x/idx', 'micrograph_thumbnail_blob_2x/shape', 'micrograph_thumbnail_blob_2x/format', 'micrograph_thumbnail_blob_2x/binfactor', 'micrograph_thumbnail_blob_2x/micrograph_path', 'micrograph_thumbnail_blob_2x/vmin', 'micrograph_thumbnail_blob_2x/vmax', 'background_blob/path', 'background_blob/idx', 'background_blob/binfactor', 'background_blob/shape', 'background_blob/psize_A', 'rigid_motion/type', 'rigid_motion/path', 'rigid_motion/idx', 'rigid_motion/frame_start', 'rigid_motion/frame_end', 'rigid_motion/zero_shift_frame', 'rigid_motion/psize_A', 'spline_motion/type', 'spline_motion/path', 'spline_motion/idx', 'spline_motion/frame_start', 'spline_motion/frame_end', 'spline_motion/zero_shift_frame', 'spline_motion/psize_A']
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 452 MB] --------------------------------------------------------------
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 452 MB] Compiling job outputs...
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 452 MB] Passing through outputs for output group micrographs from input group movies
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 455 MB] This job outputted results ['micrograph_blob_non_dw', 'micrograph_blob_non_dw_AB', 'micrograph_thumbnail_blob_1x', 'micrograph_thumbnail_blob_2x', 'movie_blob', 'micrograph_blob', 'background_blob', 'rigid_motion', 'spline_motion']
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 455 MB]   Loaded output dset with 1937 items
[Tue, 21 Jan 2025 00:42:34 GMT] [CPU RAM used: 455 MB] Passthrough results ['gain_ref_blob', 'mscope_params']
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Loaded passthrough dset with 12315 items
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Intersection of output and passthrough has 1937 items
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Output dataset contains:  ['mscope_params', 'gain_ref_blob']
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Outputting passthrough result gain_ref_blob
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Outputting passthrough result mscope_params
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB] Passing through outputs for output group micrographs_incomplete from input group movies
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB] This job outputted results ['micrograph_blob']
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Loaded output dset with 10378 items
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB] Passthrough results ['movie_blob', 'gain_ref_blob', 'mscope_params']
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Loaded passthrough dset with 12315 items
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Intersection of output and passthrough has 10378 items
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Output dataset contains:  ['movie_blob', 'mscope_params', 'gain_ref_blob']
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Outputting passthrough result movie_blob
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Outputting passthrough result gain_ref_blob
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB]   Outputting passthrough result mscope_params
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB] Checking outputs for output group micrographs
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 457 MB] Checking outputs for output group micrographs_incomplete
[Tue, 21 Jan 2025 00:42:35 GMT] [CPU RAM used: 458 MB] Updating job size...
[Tue, 21 Jan 2025 00:42:41 GMT] [CPU RAM used: 459 MB] Exporting job and creating csg files...
[Tue, 21 Jan 2025 00:42:41 GMT] [CPU RAM used: 459 MB] ***************************************************************
[Tue, 21 Jan 2025 00:42:41 GMT] [CPU RAM used: 459 MB] Job complete. Total time 10065.73s
$ cryosparcm joblog P4 J85 | tail -n 40
/work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job.log: No such file or directory

I don’t know why I can’t access the joblog.

[davidobe@c0910 ~]$ cryosparcm cli "get_job('P4', 'J85', 'version', 'job_type', 'params_spec', 'status', 'instance_information')"
{'_id': '678eb8d863481fa4fb241d4b', 'instance_information': {'CUDA_version': '11.8', 'available_memory': '240.74GB', 'cpu_model': 'Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz', 'driver_version': '12.4', 'gpu_info': [{'id': 0, 'mem': 47803596800, 'name': 'NVIDIA L40S', 'pcie': '0000:31:00'}, {'id': 1, 'mem': 47803596800, 'name': 'NVIDIA L40S', 'pcie': '0000:b1:00'}], 'ofd_hard_limit': 65536, 'ofd_soft_limit': 65536, 'physical_cores': 32, 'platform_architecture': 'x86_64', 'platform_node': 'c0910.swan.hcc.unl.edu', 'platform_release': '4.18.0-553.34.1.el8_10.x86_64', 'platform_version': '#1 SMP Wed Jan 8 09:40:06 EST 2025', 'total_memory': '251.05GB', 'used_memory': '8.76GB'}, 'job_type': 'patch_motion_correction_multi', 'params_spec': {'compute_num_gpus': {'value': 2}}, 'project_uid': 'P4', 'status': 'completed', 'uid': 'J85', 'version': 'v4.6.2'}

I will appreciate if this can be looked into.

Thanks.

wtempel · January 21, 2025, 5:14pm

Welcome to the forum @DavidObe .

Has the project directory storage become unavailable?

ls -l /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/
ls -al /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/
grep "$(df /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/ | tail -n 1 | awk '{print $NF}') " /proc/mounts

DavidObe · January 21, 2025, 5:24pm

I think the directory is still available. Here is what I got from code line2 and 3

[davidobe@c0910 ~]$ ls -al /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/
total 5936
drwxrwsr-x  5 davidobe mwilsonlab   33280 Jan 20 18:42 .
drwxrwsr-x 90 davidobe mwilsonlab   41472 Jan 20 18:45 ..
-rw-rw-r--  1 davidobe mwilsonlab 3758382 Jan 20 18:42 events.bson
drwxrwsr-x  2 davidobe mwilsonlab   33280 Jan 20 18:42 gridfs_data
-rw-rw-r--  1 davidobe mwilsonlab    1462 Jan 20 18:42 J85_micrographs.csg
-rw-rw-r--  1 davidobe mwilsonlab     682 Jan 20 18:42 J85_micrographs_incomplete.csg
-rw-rw-r--  1 davidobe mwilsonlab  233080 Jan 20 18:42 J85_passthrough_micrographs.cs
-rw-rw-r--  1 davidobe mwilsonlab 2636908 Jan 20 18:42 J85_passthrough_micrographs_incomplete.cs
-rw-rw-r--  1 davidobe mwilsonlab   64483 Jan 20 18:43 job_9039423.err
-rw-rw-r--  1 davidobe mwilsonlab  350921 Jan 20 18:43 job_9039423.out
-rw-rw-r--  1 davidobe mwilsonlab   54320 Jan 20 18:42 job.json
-rw-rw-r--  1 davidobe mwilsonlab 2995609 Jan 20 18:42 micrographs_rigid_aligned.cs
-rw-rw-r--  1 davidobe mwilsonlab  477836 Jan 20 18:42 micrographs_rigid_aligned_incomplete.cs
drwxrwsr-x  2 davidobe mwilsonlab 2974208 Jan 20 18:42 motioncorrected
-rw-rw-r--  1 davidobe mwilsonlab     835 Jan 20 15:54 queue_sub_script.sh
drwxrwsr-x  2 davidobe mwilsonlab 1073664 Jan 20 18:42 thumbnails
[davidobe@c0910 ~]$ grep "$(df /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/ | tail -n 1 | awk '{print $NF}') " /proc/mounts
10.137.41.29@o2ib,10.138.41.29@tcp:/Swan /work lustre rw,nosuid,nodev,checksum,localflock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt 0 0
10.137.41.29@o2ib,10.138.41.29@tcp:/Swan /work lustre rw,nosuid,nodev,checksum,localflock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt 0 0

wtempel · January 21, 2025, 5:39pm

Thanks @DavidObe Please can you post the outputs of these commands:

cat /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/queue_sub_script.sh
tail -n 40 /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_9039423.err
tail -n 40 /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_9039423.out

DavidObe · January 21, 2025, 5:50pm

[davidobe@c0910 ~]$ cat /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/queue_sub_script.sh
#!/bin/bash
#SBATCH --job-name=cryosparc_P4_J85
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --partition=gpucryo,gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=128000MB
#SBATCH --time=48:00:00
#SBATCH --output=/work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_%J.out
#SBATCH --error=/work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_%J.err
#SBATCH --export=NONE

module purge
module load apptainer

echo "CUDA_VISIBLE_DEVICES="0

#export RUN_ARGS="--project P4 --job J85 --master_hostname c0910 --master_command_core_port 32143"

apptainer exec -B /work/mwilsonlab/davidobe/cryosparc/config.sh:/opt/cryosparc/cryosparc_worker/config.sh /work/HCC/APPTAINER/IMAGES/cryosparc-4.6.2.sif /opt/cryosparc/cryosparc_worker/bin/cryosparcw run --project P4 --job J85 --master_hostname c0910 --master_command_core_port 32143

[davidobe@c0910 ~]$ tail -n 40 /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_9039423.err
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777285 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777250 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777221 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777286 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777274 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777218 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777268 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777242 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777257 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777302 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777277 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777293 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777226 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777292 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777227 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777257 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777243 while it should equal total # pixels = 16777216
Warning ElectronCountedFramesDecompressor: invalid RLE decoding. resulting outCount = 16777246 while it should equal total # pixels = 16777216
Received SIGSEGV (addr=000015412bfd3000)
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x15424888e9f3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x154252194420]
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(_Z17doDecompressImageI18electronAdderDeltaILi0EEhEjR11BitStreamerPT0_16EerFrameSettings+0x268)[0x1542488a7aa8]
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(_ZN33ElectronCountedFramesDecompressor21decompressImage_AddToEPhii+0x1d9)[0x15424889e949]
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(eer_decompress_addto+0x11)[0x15424889e9a1]
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(eer_readmic+0x377)[0x154248896ac7]
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(iosys_fn+0x4bd)[0x15424889d82d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x154252188609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x154251f51353]
rax 0000000000000000  rbx 000015423f0f9c40  rcx 000015412bfd2ff8  rdx 0000000000000001  
rsi 000015406c597010  rdi 000000000000000b  rbp 000015423f0f9c10  rsp 000015423f0f99d0  
r8  000000000000007f  r9  0000000000000002  r10 0000000000000000  r11 0000000000000002  
r12 0000000000fa011c  r13 0000000001000000  r14 0000000000000100  r15 00000000000007ff  
d8 01 f0 48 8b b5 e0 fd ff ff d3 e8 0f af 85 00 fe ff ff 01 d0 89 c0 80 04 06 01 48 8b
85 20 fe ff ff 4c 8b 53 10 48 8b 53 18 44 8b 00 44 89 f8 44 21 d0 48 39 d7 0f 82 8c 00
00 00 48 8b 4b 08
-->   4c 8b 51 08 48 8d 71 08 48 89 f9 48 29 d1 48 89 73 08 48 8b b5 e8 fd ff ff 4d 89
d6 49 d3 ee 48 01 d6 66 49 0f 6e c6 41 be 01 00 00 00 66 48 0f 6e ce 41 d3 e6 66 0f 6c
c1 44 89 f1 0f 11 43 10

/opt/cryosparc/cryosparc_worker/bin/cryosparcw: line 153: 311482 Killed                  python -c "import cryosparc_compute.run as run; run.run()" "$@"

[davidobe@c0910 ~]$ tail -n 40 /work/mwilsonlab/davidobe/cryo_mh_12_1_24/CS-mhthi4/J85/job_9039423.out
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:41:14.952946
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:41:24.966641
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:41:34.980498
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:41:44.994213
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:41:55.007778
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:42:05.021348
ElectronCountedFramesDecompressor: reading using TIFF-EER mode.
ElectronCountedFramesDecompressor::prepareRead: found 1044 frames in EER-TIFF file.
========= sending heartbeat at 2025-01-20 18:42:15.035049
========= sending heartbeat at 2025-01-20 18:42:25.048570
========= sending heartbeat at 2025-01-20 18:42:35.052298
========= sending heartbeat at 2025-01-20 18:42:45.065948
  ========= heartbeat failed at 2025-01-20 18:42:45.069676: 
========= sending heartbeat at 2025-01-20 18:42:55.079731
  ========= heartbeat failed at 2025-01-20 18:42:55.082553: 
========= sending heartbeat at 2025-01-20 18:43:05.092631
  ========= heartbeat failed at 2025-01-20 18:43:05.095844: 
 ************* Connection to cryosparc command lost. Heartbeat failed 3 consecutive times at 2025-01-20 18:43:05.095866.

Looking forward to your response.
Thank you for your support.

wtempel · January 21, 2025, 7:14pm

Thanks for posting these details.

The job.log file is needed by the cryosparcm joblog command and assumed to be present in the job directory. Cluster script templates for CryoSPARC typically end with an
{{ run_cmd }} expression, which would cause the job’s output be redirected to the job.log file inside the job directory. I suspect that in your case, the {{ run_cmd }} expression is not included in the script template and the job log contents is instead redirected to the slurm job’s stdout and stderr files.
It may be possible to enable joblog-related functions by ensuring that inside the template, the outputs of the cryosparcw run command are redirected to the
{{ job_dir_abs }}/job.log file.
You can view the script template by inspecting the output of the
cryosparcm cli "get_scheduler_targets()" command, where the template is stored under the script_tpl key of the applicable scheduler lane.
I will ask someone in our time about the segmentation error.

DavidObe · January 21, 2025, 7:27pm

Thank you for your response. Does this mean that without the joblog file, I can’t do patch motion correction?

wtempel · January 21, 2025, 7:41pm

Not necessarily. But

DavidObe:

Received SIGSEGV (addr=000015412bfd3000)
/opt/cryosparc/cryosparc_worker/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x15424888e9f3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x154252194420]

might have caused motion correction to fail.

wtempel · January 28, 2025, 10:47pm

@DavidObe Would you be willing to share with the CryoSPARC developers the EER file that triggered this error? To identify the specific file, please run another Patch Motion Correction job on a single GPU. If you can share the EER file through your organization’s on-line storage, please send me the link in a private message. Otherwise, we can make arrangements for file upload on our end. Please let us know.

DavidObe · January 29, 2025, 8:53pm

I think the second option is better. I could not find a way of sending private message to @wtempel on this platform.

Thanks.

wtempel · January 30, 2025, 10:47pm

Please see Discourse messaging guide for new users - #5 by codinghorror - Support - Discourse Meta for an explanation.
I will shortly send you a private message regarding upload arrangements.